1. Introduction
In this paper, we consider the following structured convex optimization problem
where
and
are closed proper convex functions, but not necessarily differentiable, and
is the effective domain of
h. Problems of this type frequently arise in practice, such as compressed sensing [
1], image reconstruction [
2], machine learning [
3], optimal control [
4] and power system [
5,
6,
7,
8], and so forth. The following are three interesting examples.
Example 1. ( minimization in compressed sensing). The signal recovery problems in compressed sensing [1] usually take the following formwhere , , and , which aims to get a sparse solution x of the linear system . Note that by defining and , (2) is of the form of (1). Example 2. (Regularized risk minimization). At the core of many machine learning problems is to minimize a regularized risk function [9,10]:where is the empirical risk, is a training set, and l is a convex loss function measuring the gap between v and the predicted values generated by using x. In general, is a nondifferentiable and computationally expensive convex function, whereas the regularization term is a simple convex function, say or . By defining and , (3) is also of the form of (1). Example 3. (Unconstrained transformation of a constrained problem). Given a constrained problemwhere f is a convex function and is a convex set. By introducing the indicator function of X, that is, equals 0 on X and infinity elsewhere, problem (
4)
can be written equivalently as Clearly, by setting , (5) becomes the form of (1). We note that such transformation could be very efficient in practice if the set X has some special structure [11,12]. The design of methods for solving problems of the form (
1) has attracted the attention of many researchers. We mention here four classes of these methods—operator splitting methods [
13,
14,
15], alternating direction methods of multipliers [
5,
16,
17,
18,
19], alternating linearization methods [
20,
21], and alternating linearization bundle method [
22]. They all fall within the well-known class of first-order black-box methods, that is, it is assumed that there is an
oracle that can return the (approximate) function value and one arbitrary (approximate) subgradient at any given point. Regarding the above methods, we are concerned about the following three points:
However, for some practical problems, the functions may be nondifferentiable (nonsmooth), not easy to handle, and computationally expensive in the sense that the exact first-order information may be impossible to calculate, or be computable but difficult to obtain. For example, if
f has the form
where
U is an infinite set and
is convex for any
, then it is often difficult to calculate the exact function value
. But for any tolerance
, we may usually find a lower approximation
in finite time such that
and
for some
. Then we can take a subgradient of
at
x as an approximate subgradient of
f at
x. Another example is two-stage stochastic programming (see, e.g., References [
23,
24]), in which the function value is generated after solving a series of linear programs (details will be given in the section of numerical experiments), therefore its accuracy is determined by the tolerance of the linear programming solver. Some other practical examples, such as Lagrangian relaxation, chance-constrained programs and convex composite functions, can be found in Reference [
25].
Based on the above observation, in this paper, we focus particularly on the case where the function
f is general, possibly nonsmooth and its exact function values and subgradients may be difficult to obtain, whereas the function
h is assumed to be relatively simple. Our main goal is to provide an efficient method, namely, the improved alternating linearization bundle method, for such kind of structured convex optimization problems. The basic tool we used here to handle nonsmoothness and inexactness is the bundle technique, since in the nonsmooth optimization community, bundle methods [
26,
27,
28,
29] are regarded as the most robust and reliable methods, whose variants have been well studied for handling inexact oracles [
23,
25,
30,
31,
32,
33].
Roughly speaking, our method generalizes the alternating linearization bundle method of Kiwiel [
22] from exact and inexact oracles to various oracles, including exact, inexact, partially inexact, asymptotically exact and partially asymptotically exact oracles. These oracles are covered by the so-called
on-demand accuracy oracles proposed by de Oliveira and Sagastizábal [
23], whose accuracy is controlled by two parameters: a descent target and an error bound. More precisely, the accuracy is bounded by an error bound whenever the function estimation reaches a certain descent target. The most advantage of oracles with on-demand accuracy is that the function and subgradient estimations can be rough without an accuracy control for some “non-critical” iterates, thus the computational effort can be saved.
At each iteration, the proposed algorithm alternately solves two interrelated subproblems: one is to find the proximal point of the polyhedron model of f plus the linearization of h; the other is to find the proximal point of the linearization of f plus h. We establish global convergence of the algorithm under different types of inexactness. Finally, some preliminary numerical results on a set of two-stage stochastic linear programming problems show that our method is very encouraging.
This paper is organized as follows. In
Section 2, we recall the condition of the inexact frist-order oracles. In
Section 3, we present an improved alternating linearization bundle method for structured convex optimization with inexact first-order oracles and show some properties of the algorithm. In
Section 4, we establish global convergence of the algorithm under different types of inexactness. In
Section 5, we provide some numerical experiments on two-stage stochastic linear programming problems. The notations are standard. The Euclidean inner product in
is denoted by
, and the associated norm by
.
2. Preliminaries
For a given constant
, the
-subdifferential of function
f at
x is defined by (see Reference [
34])
with
being the usual subdifferential in convex analysis [
35]. Each element
is called a subgradient. For simplicity, we use the following notations:
: the approximate f value at x, that is, ;
: an approximate subgradient of f at x, that is, ;
: the approximate F value at x, that is, .
Aiming at the special structure of problem (
1), we present a slight variant of the oracles with on-demand accuracy proposed in Reference [
23] as follows: for a given
, a descent target
and an error bound
, the approximate values
,
and
satisfy the following condition
From the relations in (
6), we see that although the error
is unknown, it has to be restricted within the error bound
whenever the descent target
is reached. This ensures that the exact and inexact function values satisfy
The advantages of oracle (
6) are that: (1) if the descent target is not reached, the calculation of oracle information can be rough without an accuracy control, which can potentially reduce the computational cost; (2) by properly choosing the parameters
and
, the oracle (
6) covers various existing oracles:
Exact (Ex) [
12,
21]: set
and
;
Partially Inexact (PI) [
24]: set
and
;
Inexact (IE) [
11,
25,
32,
36,
37]: set
and
(possibly unknown);
Asymptotically Exact (AE) [
38,
39]: set
and
along the iterative process;
Partially Asymptotically Exact (PAE) [
23]: set
and
.
3. The Generalized Alternating Linearization Bundle Method
In this section, we present our generalized alternating linearization bundle method with inexact first-order oracles for solving (
1).
Let
k be the current iteration index,
be given points generated by previous iterations, and the corresponding approximate values
be produced by the oracle (
6). For notational convenience, we denote
The approximate linearizations of
f at
are given by
From the second relation in (
6), we have
which implies that
is a lower approximation to
f. Next, it is natural to define the polyhedral inexact cutting-plane model of
f by
which is obviously a lower polyhedral model for
f, that is,
.
Let
(called
stability center) be the “best” point obtained so far, which satisfies that
for some
. Frequently, it holds that
. Thus, from (
7), we have
By applying the bundle idea to the “complex” function
f, and keeping the simple function
h unchanged, similar to traditional proximal bundle methods (see, e.g., Reference [
28]), we may solve the following subproblem to obtain a new iterate
:
where
is a proximal parameter. However, subproblem (
10) is generally not easy to solve, so by making use of the alternating linearization idea of Kiwiel [
22], we solve two easier subproblems instead of (
10). These two subproblems are interrelated: one is to find the proximal point of the polyhedron model
plus the linearization of
h, aiming at generating an aggregate linear model of
f for use in the second subproblem; the other is to find the proximal point of the aggregate linear model of
f plus
h, aiming at obtaining a new trial point.
Now, we are ready to present the details of our algorithm, which generalizes the work of Kiwiel [
22]. We note that the choice of the model function
in the algorithm may be different from the form of (
8), since the subgradient aggregation strategy [
40] is used to compress the bundle. The algorithm generates three sequences of iterates as follows:
, the sequence of intermediate points, at which the aggregate linear models of
f are generated;
, the sequence of trial points;
, the sequence of stability centers.
We make some comments about Algorithm 1 as follows.
Algorithm 1 Generalized alternating linearization bundle method |
Step 0 (Initialization). Select an initial point , constants , , and an initial stepsize . Call the oracle (6) at to compute the approximate values and . Choose an initial error bound and a descent target . Set , , , , and with . Let , , and . Step 1 (Model selection). Choose closed convex and such that
Step 2 (Solve f-subproblem). Set
Step 3 (Solve h-subproblem). Set
Step 4 (Stopping criterion). Compute
If , stop. Step 5 (Noise attenuation). If , set , , and go back to Step 2. Step 6 (Call oracle). Select a new error bound and a new descent target . Call the oracle (6) to compute and . Step 7 (Descent test). If the descent condition
holds, set , , , , and (descent step); otherwise, set , , , and (null step). Step 8 (Stepsize updating). For a descent step, select . For a null step, either set or choose if . Step 9 (Loop). Let , and go to Step 1.
|
Remark 1. (i) Theoretically speaking, the model function can be the simplest form , but in order to keep numerical stability, it may additionally consist of some active linearizations.
(ii) Alternately solving subproblems (11) and (13) can be regarded as the proximal alternating linearization method (e.g., Reference [21]) being applied to the function . (iii) If is a polyhedral function, then subproblem (11) is equivalent to a convex quadratic programming and thus can be solved efficiently. In addition, if h is simple, subproblem (13) can also be solved easily, or even has a closed-form solution (say ). (iv) The role of Step 5 is to reduce the impact of inexactness. The algorithm loops between steps 2–5 by increasing the step size until .
(v) The stability center, descent target and error bound keep unchanged in the loop between Steps 2 and 5
(vi) In order to establish global convergence of the algorithm, the descent target and error bound at Step 6 should be suitably updated. Some detailed rules are presented in the next section.
The following lemma summarizes some fundamental properties of Algorithm 1, whose proof is a slight modification of that in [
22], Lemma 2.2.
Lemma 1. (i) The vectors and of (
12)
and (
14)
satisfy The linearizations , , satisfy the following inequalities (ii) The aggregate subgradient defined in (
15)
and the above linearization can be expressed as follows (iii) The predicted descent and the aggregate linearization error of (
15)
satisfy (iv) The aggregate linearization is also expressed (v) Denote the optimality measure bywhich satisfiesand We have the relations Moreover, if , then we have and Proof. (i) From the optimality condition of subproblem (
11), we obtain
which implies
. In addition, the fact that
yields
. Similarly, by the optimality condition of (
14), we have
which shows
. Further from
, we obtain
. Finally, it follows that
Utilizing the linearity of
, (
12) and (
19), we derive
(iii) We obtain directly
from (
15). Combining (
15) and (ii), we have
(iv) Since
, the aggregate lineaization
satisfies
(v) Using the Cauchy-Schwarz inequality in the definition (
22) gives
(vi) By (iii), it is easy to get (
25). Next, by (
18), (
20) and (
9), we conclude that, if
,
Relation (
26) follows from the facts that
and
. Relation (
27) follows from (
23),
and
. Finally, if
, we obtain
, which together with
shows that
, and therefore (
28) holds. □
Remark 2. Relation (17) shows that is a subgradient of the model function at and that is a subgradient of h at . defined by (22) can be viewed as an optimality measure of the iterates, which will be proved to converge to zero in the next section. Relation (24) is also a test for optimality, in that is an approximate optimal solution to problem (
1)
whenever is sufficiently small. 4. Global Convergence
This section aims to establish the global convergence of Algorithm 1 for various oracles. These oracles are controlled by two parameters: the error bound
and the descent target
. In
Table 1, we present the choices of these two parameters for different type of instances described in
Section 2, including Exact (Ex), Partially Inexact (PI), Inexact (IE), Asymptotically Exact (AE) and Partially Asymptotically Exact (PAE) oracles, where the constants are selected as
,
, and
.
The following lemma is crucial to guarantee the global convergence of Algorithm 1.
Lemma 2. The descent target is always reached at the stability centers, that is, for all .
Proof. For instances Ex, IE and AE, since , the claim holds immediately.
For instances PI and PAE, we have
. Thus, for
, from Step 0 it follows that
,
and
. This implies
. In addition, for
, since
, once the descent test (
16) is satisfied at iteration
, we have
□
The following lemma shows that an (approximate) optimal solution can be obtained whenever the algorithm terminates finitely or loops infinitely between Steps 2 and 5.
Lemma 3. If either Algorithm 1 terminates at the kth iteration at Step 4, or loops between Steps 2 and 5 infinitely, then
- (i)
is an optimal solution to problem (
1)
for instances Ex and PI. - (ii)
is ε-optimal, that is, , for instance IE.
- (iii)
is -optimal, that is, , for instances AE and PAE.
Proof. Firstly, suppose that Algorithm 1 terminates at Step 4 with iteration
k. Then from (
23), we have
. This together with (
24) shows that
Thus, from (
7), we can conclude that:
for instances Ex and PI;
for instance IE;
for instances AE and PAE.
Secondly, suppose that Algorithm 1 loops between Steps 2 and 5 infinitely. Then from Lemma 2 and the condition at Step 5, it follows that (
28) holds and
. Thus, we obtain
. This along with (
24) implies (
29), and therefore the claims hold by repeating the corresponding lines in first case. □
From the above lemma, in what follows, we may assume that Algorithm 1 neither terminates finitely nor loops infinitely between Steps 2 and 5. In addition, as in Reference [
22], we assume that the model subgradients
in (
17) satisfy that
is bounded if
is bounded.
Algorithm 1 must take only one of the following two cases:
- (i)
the algorithm generates finitely many descent steps;
- (ii)
the algorithm generates infinitely many descent steps.
We first consider case (i), in which two subcases may occur: and . The first subcase of is analyzed in the following lemma.
Lemma 4. Suppose that Algorithm 1 generates finitely many descent steps, that is, there exists an index such that only null steps occur for all , and that . Denote , then as .
Proof. For the last time
increases before Step 5 for
, one has
which along with
shows the lemma. □
The following lemma analyzes the second subcase of .
Lemma 5. Suppose that there exists such that and for all . If the descent criterion (
16)
fails for all , then . Proof. In view of the facts that
and
for all
, we know that only null steps occur and
does not increase at Step 5. By Taylor’s expansion, Cauchy-Schwarz inequality, and the properties of subproblems (
11) and (
13), we can conclude that
, so the conclusion holds from (
27). For more details, one can refer to [
11], Lemma 3.2. □
By combining Lemmas 4 and 5, we have the following lemma.
Lemma 6. Suppose that there exists such that only null steps occur for all . Let if ; otherwise. Then .
Now, we can present the main convergence result for the case where the algorithm generates finitely many descent steps.
Theorem 1. Suppose that Algorithm 1 generates finitely many descent steps, and that is the last stability center. Then, is an optimal solution to problem (
1)
for instances Ex and PI; an ε-optimal solution for IE; and an -optimal solution for AE and PAE. Proof. Under the stated assumption, we know that
and
for all
. This together with (
24) and Lemma 6 shows that
Hence, similar to the proof of Lemma 3, we obtain the results of the theorem. □
Next, we consider the second case where the algorithm generates infinitely many descent steps.
Lemma 7. Suppose that Algorithm 1 generates infinitely many descent steps, and that . Let . Then and . Moreover, if is bounded, then .
Proof. From the descent test condition (
16), we may first prove that
, and therefore
from (
26) and
from the fact that
. It can be further proved that
, so it follows
from the definition of
. Moreover, under the condition that
is bounded, we have
by (v) of Lemma 1. For more details, one can refer to [
11], Lemma 3.4. □
Finally, we present the convergence results for the second case.
Theorem 2. Suppose that Algorithm 1 generates infinitely many descent steps, , and that the index set K is defined in Lemma 7. Then
- (i)
for instance IE in Table 1; - (ii)
for the remaining instances in Table 1; - (iii)
and .
Proof. (i) For instance IE, it follows that
and
for all
. Then from (
7), we have
,
, which implies
This along with (
30) shows part (i).
(ii) Next, the other four instances in
Table 1 are considered separately.
For instance Ex, we have
,
and
. This implies
For instance PI, we have
and
for all
. Thus, we obtain
This implies
, and therefore
For instance AE, we have
and
for all
, which implies
This along with Lemma 7 (
) shows that
For instance PAE, we have
and
for all
. Then, it follows that
which implies
Again from Lemma 7, we obtain
Summarizing the above analysis and noticing (
30), we complete the proof of part (ii).
(iii) From Lemma 7, we know that
. This together with (
24) shows part (iii). □
5. Numerical Experiments
In this section, we aim to test the numerical efficiency of the proposed algorithm. In the fields of production and transportation, finance and insurance, power industry, and telecommunications, decision makers usually need to solve problems with uncertain information. As an effective tool to solve such problems, stochastic programming (SP) has attracted more and more attention and research on its practical instances and theories; see, for example, References [
41,
42]. We consider a class of two-stage SP problems with fixed recourse, whose discretization of uncertainty into
N scenarios has the form (see e.g., References [
23,
43])
where
x is the first-stage decision variable,
,
, and
. In addition, the recourse function is
where corresponding to the
ith scenario
, with probability
for
and
. Here
is the second-stage decision variable.
Clearly, by introducing the indicator function
, problem (
31) can be written as the form of (
5), and then becomes the form of (
1) by setting
.
The above recourse function can be written as its dual form:
where
and
. By solving these linear programming problems to return solutions with precision up to a given tolerance, one can establish an inexact oracle in the form (
6), see Reference [
23] for more detailed description.
Four instances are tested, namely, SSN(50), SSN(100), 20-term(50), 20-term(100), where the integers in the brackets mean the number of scenarios
N. Here, the SSN instances come from the telecommunications and have been studied by Sen, Doverspike, and Cosares [
44]. And the 20-term instances come from the motor freight carrier’s problem and have been studied by Mak, Morton, and Wood [
45]. The dimensions of these instances are listed in
Table 2.
The parameters are selected as:
,
and
. The maximum bundle size is set to be 35. All the tests were performed in MATLAB (R2014a) on a PC with Intel(R) Core(TM) i7-4790 CPU 3.60GHz, 4GB RAM. The quadratic programming and linear programming subproblems were solved by the MOSEK 8 toolbox for MATLAB; see
http://www.mosek.com.
We first compare our algorithm (denoted by GALBM) with the accelerated prox-level method (APL) in Reference [
43], where the tolerances of the linear programming solver of MOSEK are set by default. The results are listed in
Table 3, in which the number of iterations (NI), the consumed CPU time in seconds (Time), and the returned minimum values (
) are compared. Note that we use the MATLAB commands
tic and
toc to measure the consumed CPU time. For each instance, we run 10 times and report the average CPU time. From
Table 3, we see that, when similar solution quality is achieved, GALBM can significantly outperform than APL in terms of the number of iterations and CPU time.
In what follows, we are interested in evaluating the impact of inexact oracles for GALBM. In more detail, we carry out two groups of tests. The first group adopts fixed tolerances, that is,
, and the corresponding results are reported in
Table 4,
Table 5,
Table 6 and
Table 7. Whereas the second group adopts dynamic tolerances with a safeguard parameter
, that is,
with
, and the corresponding results are reported in
Table 8,
Table 9,
Table 10 and
Table 11. The symbol “-” in the following tables means that the number of iterations for the corresponding instance is greater than 500.