Next Article in Journal
DrugFinder: Druggable Protein Identification Model Based on Pre-Trained Models and Evolutionary Information
Next Article in Special Issue
Optimal Maintenance Schedule for a Wind Power Turbine with Aging Components
Previous Article in Journal
Autonomous Electric Vehicle Route Optimization Considering Regenerative Braking Dynamic Low-Speed Boundary
Previous Article in Special Issue
A Linearly Involved Generalized Moreau Enhancement of 2,1-Norm with Application to Weighted Group Sparse Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On the Adaptive Penalty Parameter Selection in ADMM

by
Serena Crisci
1,†,
Valentina De Simone
1,*,† and
Marco Viola
2,†
1
Department of Mathematics and Physics, University of Campania “Luigi Vanvitelli”, Viale Abramo Lincoln, 5, 81100 Caserta, Italy
2
School of Mathematics and Statistics, University College Dublin, Belfield, D04 V1W8 Dublin, Ireland
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Algorithms 2023, 16(6), 264; https://doi.org/10.3390/a16060264
Submission received: 4 April 2023 / Revised: 17 May 2023 / Accepted: 19 May 2023 / Published: 25 May 2023
(This article belongs to the Special Issue Recent Advances in Nonsmooth Optimization and Analysis)

Abstract

:
Many data analysis problems can be modeled as a constrained optimization problem characterized by nonsmooth functionals, often because of the presence of 1 -regularization terms. One of the most effective ways to solve such problems is through the Alternate Direction Method of Multipliers (ADMM), which has been proved to have good theoretical convergence properties even if the arising subproblems are solved inexactly. Nevertheless, experience shows that the choice of the parameter τ penalizing the constraint violation in the Augmented Lagrangian underlying ADMM affects the method’s performance. To this end, strategies for the adaptive selection of such parameter have been analyzed in the literature and are still of great interest. In this paper, starting from an adaptive spectral strategy recently proposed in the literature, we investigate the use of different strategies based on Barzilai–Borwein-like stepsize rules. We test the effectiveness of the proposed strategies in the solution of real-life consensus logistic regression and portfolio optimization problems.

1. Introduction

The alternating direction method of multipliers [1] (ADMM) has been recognized as a simple but powerful algorithm to solve optimization problems of the form
min u R n , v R m H ( u ) + G ( v ) subject to E u + F v = d
where H : R n R { + } and G : R m R { + } are closed, proper, and convex functions, E R p × n , F R p × m , d R p . ADMM splits the problem into smaller pieces, each of which is then easier to handle, blending the benefits of dual decomposition and augmented Lagrangian methods [1]. Starting from an initialization ( u 0 , v 0 , ξ 0 ) and τ > 0 , at each iteration, ADMM updates the primal and dual variables as
u k + 1 = argmin u H ( u ) + τ 2 d E u F v k + ξ k τ 2
v k + 1 = argmin v G ( v ) + τ 2 d E u k + 1 F v + ξ k τ 2
ξ k + 1 = ξ k + τ d E u k + 1 F v k + 1 .
ADMM is guaranteed to converge under mild assumptions for any fixed value of the penalty parameter  τ , even if the subproblems are solved inexactly [2]. Despite this, it is well-known that the choice of the parameter τ is problem-dependent and can affect the practical performance of the algorithm, yielding a highly inefficient method when it is not properly selected. Some work has been done to develop suitable techniques for tuning the values of  τ at each iteration, with the aim of speeding up the convergence in practical applications [3,4,5,6]. In [3], the authors proposed an adaptive strategy based on primal and dual residuals, with the idea that τ k forces both to have a similar magnitude. Such a scheme is not guaranteed to converge; nevertheless, the standard convergence theory (fixed values of τ ) still applies if one assumes that τ k becomes fixed after a finite number of iterations. A more general and reliable approach has been introduced in [7], where the authors proposed a strategy for automatic selection of the penalty parameter in ADMM, borrowing ideas from spectral stepsize selection strategies in gradient-based methods for unconstrained optimization [8,9,10,11]. In particular, starting from the analysis of the dual unconstrained formulation of problem (1), in [7] an optimal penalty parameter at each iteration is defined as the reciprocal of the geometric mean of Barzilai–Borwein-like spectral stepsizes [8,10], corresponding to a gradient step of the Fenchel conjugates of the functions H and G, respectively.
Relying on the procedure suggested in [7], this paper aims at investigating the practical efficiency of ADMM, employing adaptive selections of the penalty parameter based on different spectral stepsize rules. Indeed, spectral analysis of Barzilai–Borwein (BB) rules (and their variants) has shown how different choices can influence the practical acceleration of gradient-based methods for both constrained and unconstrained smooth optimization, due to the intrinsic different abilities of such stepsizes of capturing the spectral properties of the problem. Indeed, for strictly convex quadratic problems, the Barzilai–Borwein updating rules correspond to the inverses of the Rayleigh quotients of the Hessian, thus providing suitable approximations of the inverses of its eigenvalues. Such an ability has been exploited within ad hoc steplength selection strategies to obtain practical accelerations of gradient-based methods; moreover, this property is preserved in the case of general non-quadratic minimization problems [9,11,12,13]. In this view, we combine the adaptive ADMM scheme with state-of-the-art stepsizes to compute reliable approximations of the penalty parameter τ k at each iteration. The resulting variants of the ADMM scheme thus obtained are compared on two real-life applications in the frameworks of binary classification on distributed architectures and portfolio selection.
The paper is organized as follows. The adaptive ADMM algorithm is described in Section 2. Numerical experiments are reported in Section 3. Finally, some conclusions are drawn in Section 4.

2. Adaptive Penalty Parameter Selection in ADMM Method

In this section, we describe the strategy for automatic selection of the penalty parameter in ADMM according to the procedure proposed in [7], in which the authors introduced an adaptive selection of τ based on the spectral properties of the Douglas–Rachford (DR) splitting method applied to the dual problem of (1).
Given a closed convex (proper) function f defined on R n , the Fenchel conjugate of f is the closed convex (proper) function f * defined by
f * ( x * ) = sup x { x x * f ( x ) } = inf x { f ( x ) x x * } ,
(see [14]). The dual problem of (1) is given by
max ξ R p ξ d H * ( E ξ ) G * ( F ξ ) ,
where H * and G * denote, respectively, the Fenchel conjugate of H and G. Problem (5) can be equivalently rewritten as
min ξ R p H * ( E ξ ) ξ d + G * ( F ξ ) .
It can be proved that solving (1) by ADMM is equivalent to solving the dual problem (6) by means of the DR method [15], which, in turn, is equivalent to applying the DR scheme to
0 ( E H * ( E ξ ) d ) + F G * ( F ξ ) .
In this way, two sequences ( ξ ¯ k ) k and ( ξ k ) k are generated such that
0 ξ ¯ k + 1 ξ k τ k + ( E H * ( E ξ ¯ k + 1 ) d ) + F G * ( F ξ k ) ,
0 ξ k + 1 ξ k τ k + ( E H * ( E ξ ¯ k + 1 ) d ) + F G * ( F ξ k + 1 ) .
Then, Proposition 1 in [7] proves that the choice of the parameter τ k that guarantees the minimal residual of ( E H * ( E ξ k + 1 ) d ) + F G * ( F ξ k + 1 ) in DR steps is given by
τ k = α k β k
where α k , β k > 0 are BB stepsizes arising by imposing the following quasi-Newton conditions:
α k = argmin α R α 1 ( ξ ¯ k ξ ¯ k 1 ) ( ϕ k ϕ k 1 ) ,
β k = argmin β R β 1 ( ξ k ξ k 1 ) ( ψ k ψ k 1 ) ,
where ϕ k E H * ( E ξ ¯ k ) d , ϕ k 1 E H * ( E ξ ¯ k 1 ) d , ψ k F G * ( F ξ k ) , and ψ k 1 F G * ( F ξ k 1 ). Note that 1 / α k and 1 / β k can be interpreted as spectral gradient stepsizes of type BB1 for H * ( E ξ ¯ k ) ( ξ ¯ k ) d and G * ( F ξ k ) , respectively.
Based on the equivalence between DR and ADMM, the optimal DR stepsize τ k defined in (9) corresponds to the optimal penalty parameter for the ADMM scheme. Moreover, to compute practical estimates of these optimal parameters for ADMM, the dual problem is not required to be supplied, thanks to the theoretical link between primal and dual variables. Indeed, the optimality condition for Problem (2) prescribes
0 H ( u k + 1 ) E ξ k + τ k ( d E u k + 1 F v k ) ,
which is equivalent to
E ξ k + τ k ( d E u k + 1 F v k ) H ( u k + 1 ) .
Recalling that for a closed proper convex function f, x f * ( x * ) if and only if x * f ( x ) (see [14], Corollary 23.5.1), from the previous relation, we obtain
u k + 1 H * E ξ k + τ k ( d E u k + 1 F v k ) ,
and, hence, it follows
E u k + 1 d E H * E ξ ¯ k + 1 d ,
where ξ ¯ k + 1 : = ξ k + τ k ( d E u k + 1 F v k ) . Similarly, from the optimality condition for the subproblem (3), one can obtain
F v k + 1 F G * ( F ξ k + 1 ) .
From (12) and (13), we have
ξ ¯ k + 1 ξ k τ k ( E H * ( E ξ ¯ k + 1 ) d ) + F G * ( F ξ k ) ,
ξ k + 1 ξ k τ k ( E H * ( E ξ ¯ k + 1 ) d ) + F G * ( F ξ k + 1 ) .
Finally, one can define Δ ξ ¯ k 1 = ξ ¯ k ξ ¯ k 1 and Δ H ¯ k 1 = H ¯ ( ξ ¯ k ) H ¯ ( ξ ¯ k 1 ) , where H ¯ ( ξ ) : = H * ( E ξ ) ξ d , and the set subtraction is given by the Minkowski–Pontryagin difference [16,17]. In particular, a practical computation of an element of Δ H ¯ k 1 can be provided through quantities available at the current ADMM iteration by exploiting (12); thus, with a slight abuse of notation, we may write
Δ H ¯ k 1 = E ( u k u k 1 ) .
Then, the two BB-based rules can be recovered as
α k B B 1 = argmin α R α 1 Δ ξ ¯ k 1 Δ H ¯ k 1 = Δ ξ ¯ k 1 2 ( Δ ξ ¯ k 1 ) Δ H ¯ k 1 ,
α k B B 2 = argmin α R Δ ξ ¯ k 1 α Δ H ¯ k 1 = ( Δ ξ ¯ k 1 ) Δ H ¯ k 1 Δ H ¯ k 1 2 .
With a similar argument, the curvature estimates of G ¯ ( ξ ) : = G * ( F ξ ) are provided by the following stepsizes:
β k B B 1 = argmin β R β 1 Δ ξ k 1 Δ G ¯ k 1 = Δ ξ k 1 2 ( Δ ξ k 1 ) Δ G ¯ k 1 ,
β k B B 2 = argmin β R Δ ξ k 1 β Δ G ¯ k 1 = ( Δ ξ k 1 ) Δ G k 1 Δ G ¯ k 1 2 ,
where Δ ξ k 1 = ξ k ξ k 1 and Δ G ¯ k 1 = F ( v k v k 1 ) . The previous quasi-Newton conditions express a local property of linearity of the dual subgradients with respect to the dual variables. The validity of this assumption can be checked during the iterative procedure to test the reliability of the spectral BB-based parameters. In particular, the stepsizes (16)–(19) can be considered reliable when, respectively, the ratios
α cor = ( Δ ξ ¯ k 1 ) Δ H ¯ k 1 Δ ξ ¯ k 1 Δ H ¯ k 1 and β cor = ( Δ ξ k 1 ) Δ G ¯ k 1 Δ ξ k 1 Δ G ¯ k 1
are bounded away from zero.
Then, as safeguarding condition, the update of the penalty parameter is realized in accordance with (9) when both the ratios in (20) are greater than a prefixed threshold ϵ ¯ ( 0 , 1 ) expressing the required level of reliability provided by the estimates (10) and (11). If only one of the ratios satisfies the safeguarding condition, the corresponding stepsize is used to estimate τ k , which eventually is set equal to the last updated value when both stepsizes are considered inaccurate, i.e., when α c o r ϵ ¯ and β c o r ϵ ¯ .
A general ADMM scheme with adaptive selection of the penalty parameter based on the described procedure is outlined in Algorithm 1. We remark that different versions of Algorithm 1 arise depending on the different rules selected for computing the spectral stepsizes α k and β k in STEP 4. In particular, in the scheme originally proposed in [7], α k and β k are provided by a generalization of the adaptive steepest descent (ASD) strategy introduced in [10], which performs a proper alternation of larger and smaller stepsizes. In the next section, we will compare this procedure with other updating rules based on both single and alternating BB-based strategies.
As final remark, we recall that a proof of the convergence of ADMM with variable penalty parameter was provided in [3] under suitable assumptions on the increase and decrease in the sequence { τ k } k (see Section 4—Theorem 4.1 in [3]). Although this convergence analysis cannot straightforwardly be applied to our case, as observed in [7], this issue may be bypassed in practice by turning off the adaptivity after a finite number of steps.
Algorithm 1 A general scheme for ADMM with adaptive penalty parameter selection
Initialize u 0 , v 0 , ξ 0 , τ 0 > 0 , ϵ ¯ ( 0 , 1 ) , n ¯ 1
For  k = 0 , 1 , 2 ,
STEP 1 Compute u k + 1 by solving (2).
STEP 2 Compute v k + 1 by solving (3).
STEP 3 Update ξ k + 1 by means of (4).
STEP 4 If mod ( k , n ¯ ) = 1 then
                     ξ ¯ k + 1 : = ξ k + τ k ( d E u k + 1 F v k )
                    Compute spectral stepsizes α k , β k according to (10) and (11)
                    Compute correlations α cor , β cor
                    if  α cor > ϵ cor and β cor > ϵ cor then
                         τ k + 1 = α k β k
                    elseif  α cor > ϵ ¯ and β cor ϵ ¯
                         τ k + 1 = α k
                    elseif  α cor ϵ ¯ and β cor > ϵ ¯
                         τ k + 1 = β k
                    else
                         τ k + 1 = τ k
                    endif
                else
                     τ k + 1 = τ k
                endif
STEP 5 end for

3. Numerical Experiments

In this section we present the results of the numerical experiments performed to assess the performances of Algorithm 1 equipped with different choices for the update of τ k in STEP 4. We compare the alternating strategy used in [7] with BB1-like rules α k B B 1 (16)–(18), BB2-like rules α k B B 2 (17)–(19), and a different alternating strategy, based on a modified version of the ABB min rule [18], introduced in [19]. The five algorithms compared in this section are the following:
  • “Vanilla ADMM”, in which τ k is fixed to τ 0 throughout all iterations;
  • “Adaptive ADMM” [7], in which α k and β k are set to
    α k = α k B B 2 , if   2 α k B B 2 > α k B B 1 , α k B B 1 α k B B 2 2 , otherwise , β k = β k B B 2 , if   2 β k B B 2 > β k B B 1 , β k B B 1 β k B B 2 2 , otherwise ;
  • “Adaptive ADMM-BB1”, in which α k and β k are set to α k B B 1 and β k B B 1 , respectively;
  • “Adaptive ADMM-BB2”, in which α k and β k are set to α k B B 2 and β k B B 2 , respectively;
  • “Adaptive ADMM-ABB min ”, in which α k and β k are set to
    α k A B B min = min α j B B 2 : j = max { 1 , k m α } , , k , if α k B B 2 < δ k α k B B 1 , α k B B 1 , otherwise ,
    β k A B B min = min β j B B 2 : j = max { 1 , k m α } , , k , if β k B B 2 < δ k β k B B 1 , β k B B 1 , otherwise ,
    where m α = 2 , δ 0 = 0.5 , and δ k is updated as follows:
    δ k + 1 = δ k / 1.2 , if α k B B 2 < δ k α k B B 1 or β k B B 2 < δ k β k B B 1 , δ k · 1.2 , otherwise .
For all the algorithms, following [7], we set ϵ cor = 0.2 and n ¯ = 2 . The methods stop when the relative residual is below a prefixed tolerance t o l > 0 within a maximum number of iterations, where the relative residual is defined by
max r p k 2 max { E u k 2 , F v k 2 , d 2 } , r d k 2 E ξ k 2 ,
with r p k = d E u k F v k and r d k = τ k E T F ( v k v k 1 ) . All experiments were performed using MATLAB. Recently, the PADPD algorithm [20] has been proposed as an algorithm analogous to ADMM, but it has fixed stepsize and is out of the scope of this paper to compare.

3.1. Consensus 1 -Regularized Logistic Regression

As a first experiment, we test the proposed algorithms on the solution of consensus 1 -regularized logistic regression (see Section 4.2 in [1,7]). Let us consider a dataset consisting of M training pairs ( D i , y i ) R n × { 0 , 1 } . The aim is to build a linear classifier by minimizing a regularized logistic regression functional, exploiting a distributed computing architecture. One can do so by partitioning the original dataset into S subsets of size m 1 , , m S , such that s = 1 S m s = M , and solving the optimization problem
min x 1 , , x S R n , z R n s = 1 S j = 1 m s log 1 + e y s , j D s , j x s + λ z 1 s . t . x s z = 0 , f o r s = 1 , , S ,
where x s is the local variable on the s-th computational node, which acts as a linear classifier of the s-th subset, D s , j , y s , j represents the j-th training pair of the s-th subset, and z is the global variable.
We can reformulate problem (22) as
min u , v H ( u ) + G ( v ) s . t . u F v = 0
where we set u = ( x 1 , , x S ) R n S , v = z , H ( u ) = s = 1 S j = 1 m s log 1 + e y s , j D s , j x s , G ( v ) = λ v 1 , and F = ( I n , , I n ) (i.e., the stacking of S identity matrices of order n). Scaling the dual variable, the augmented Lagrangian function associated with problem (23) is
L A = H ( u ) + G ( v ) + τ 2 u F v ξ τ 2 τ 2 ξ τ 2 ,
where ξ = ( ξ 1 , , ξ S ) R n S , τ > 0 . Starting from given estimates u 0 , v 0 , ξ 0 , and τ 0 , and observing that the minimization problem in u can be split into S independent optimization problems in x 1 , , x S , at each iteration k the adaptive ADMM updates the estimates as
x 1 k + 1 = argmin x R n j = 1 m 1 log 1 + e y 1 , j D 1 , j x + τ k 2 x v k ξ 1 k τ k 2 , x S k + 1 = argmin x R n j = 1 m S log 1 + e y S , j D S , j x + τ k 2 x v k ξ S k τ k 2 ,
v k + 1 = argmin v λ v 1 + τ k 2 u k + 1 F v ξ k τ k 2 ,
ξ k + 1 = ξ k + τ k ( u k + 1 F v k + 1 ) .
The S problems in (24), which are smooth unconstrained optimization problems, are solved approximately via BFGS with a stopping criterion on the gradient and the objective function with a tolerance set to 10 times the tolerance given to the ADMM scheme. The minimization in (25) can be performed via a soft thresholding operator.
We considered 4 datasets from the LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, accessed on 1 February 2023) dataset collection; to simulate real-life conditions, we created the instances by randomly extracted a training set consisting of 70% of the available data. 1 reports the final number of training pairs (M) and the problem dimension (n) for each of the datasets.
We considered λ = 1 for all problems to enforce sparsity in the solutions and ran all the algorithms with τ 0 = 1 , stopping when the relative residual was below t o l = 10 4 and/or the number of iterations reached 500. The results of the tests are reported in Figure 1 (left column) in terms of relative residual vs. number of iterations. We also report in the same figure (right columns) the history of τ k for each of the five algorithms.
From the pictures, it is clear that the four adaptive strategies are effective in reducing the number of iterations needed for convergence with respect to the “Vanilla ADMM”. Furthermore, the BB2 version is able to outperform the others in all the considered instances. By looking at the second and fourth row (problems “cod-rna” and “phishing”), it is interesting to observe that the performances of the adaptive strategies appear to decay as soon as τ k is kept fixed for a large number of iterations. This is particularly true for algorithms “Adaptive ADMM” and “Adaptive ADMM-BB1”.

3.2. Portfolio Optimization

In the modern portfolio theory, an optimal portfolio selection strategy has to realize a trade-off between risk and return. Recently, l1-regularized Markowitz models have been considered; the l1 penalty term is used to stabilize the solution process and to obtain sparse solutions, which allow one to reduce holding costs [21,22,23,24]. We focus on a multi-period investment strategy [23] that is either medium- or long-term; thus, it allows the periodic reallocation of wealth among the assets based on available information. The investment period is partitioned into m sub-periods, delimited by the rebalancing dates t 1 , . . . , t m + 1 , at which points decisions are taken. Let n be the number of assets and u j R n the portfolio held at the rebalancing date t j . The optimal portfolio is defined by the vector u = ( u 1 , u 2 , , u m ) R N , where N = m · n . A separable form of the risk measure obtained by summing single period terms is considered
ρ ( u ) = j = 1 m u j C j u j ,
where C j R n × n is the covariance matrix, assumed to be positive definite, estimated at  t j . With this choice, the model satisfies the time consistency property. l 1 -regularization has been used to promote sparsity in the solution. Moreover, the l 1 penalty term either avoids or limits negative solutions; thus, it is equivalent to a penalty on short positions.
Let ξ init and ξ term be the initial wealth and the target expected wealth resulting from the overall investment, respectively, and let r j be the expected return vector estimated at time j. The l1-regularized selection can be formulated as the following compact constrained optimization problem [23,24]:
min u 1 2 u C u + λ u 1 s . t . A u = b
where C = d i a g ( C 1 , C 2 , , C m ) R N × N is a m × m diagonal block matrix, A is an m × m lower bidiagonal block matrix, with blocks of dimension 1 × n , defined as
d i a g ( A ) = ( 1 n , 1 n , , 1 n ) , s u b d i a g ( A ) = ( ( 1 n + r 1 ) , , ( 1 n + r m 1 ) ) ,
and b = ( ξ init , 0 , 0 , , ξ term ) R m . Methods based on Bregman iteration have proved to be efficient for the solution of Problem (27) as well [23,24,25,26]. Now we reformulate Problem (27) as
min u , v H ( u ) + G ( v ) s . t . u v = 0
where H ( u ) = 1 2 u C u restricted to the set { u | A u = b } and G ( v ) = λ v 1 . Scaling the dual variable, the augmented Lagrangian function associated with Problem (28) is
L A = H ( u ) + G ( v ) + τ 2 u v ξ τ 2 τ 2 ξ τ 2 ,
where ξ , τ > 0 . Starting from given estimates u 0 , v 0 , ξ 0 , and τ 0 , at each iteration k, adaptive ADMM updates the estimates as
u k + 1 = argmin A u = b 1 2 u C u + τ k 2 u v k ξ k τ k 2 ,
v k + 1 = argmin v λ v 1 + τ k 2 u k + 1 v ξ k τ k 2 ,
ξ k + 1 = ξ k + τ k ( u k + 1 v k + 1 ) .
Given ( u k , v k , ξ k ) , the u update (29) is an equality-constrained problem that involves the solution of the related Karush–Kuhn–Tucker (KKT) system (see [1], Section 4.2.5), which results in a linear system with positive definite coefficient matrix. The minimization with respect to v can be carried out efficiently using the soft thresholding operator.
We test the effectiveness on the real market data. We show results obtained using the following datasets [26]:
  • FF48 (Fama & French 48 Industry portfolios): Contain monthly returns of 48 from July 1926 to December 2015. We simulate investment strategies of length 10, 20, and 30 years, with annual rebalancing.
  • NASDAQ100 (NASDAQ 100 stock Price Data): Contains monthly returns from November 2004 to April 2016. We simulate investment strategy of length 10, with annual rebalancing.
  • Dow Jones (Dow Jones Industrial): Contains monthly returns from February 1990 to April 2016. We simulate investment strategy of length 10, with annual rebalancing.
We set for all the tests λ = 10 2 to enforce a sparse portfolio. We compared the 5 considered algorithms in terms of number of iterations needed to reach a tolerance 10 5 on the relative residual, letting them run for a maximum of 3000 iterations. Moreover, we considered some financial performance measures expressing the goodness of the optimal portfolios, i.e.,
  • density: the number of nonzero elements in the solution (percentage) that gives an estimation of holding cost;
  • ratio: it estimates the risk reduction when the naive strategy (at each rebalancing date the total wealth is equally divided among the assets) is taken as the benchmark, defined as
    r a t i o = u n a i v e C u n a i v e u * C u * ,
    where the numerator is the variance of the portfolio produced by the naive strategy and the denominator is the variance of the optimal portfolio.
Table 2, Table 3, Table 4, Table 5 and Table 6 report the results of the experiments performed on the 5 datasets with different choices for the value of τ 0 (namely, 0.1 , 0.5 , 1) for a total of 15 instances.
The five algorithms are overall able to obtain equivalent portfolios when converging to the desired tolerance. However, they behave quite differently in terms of iterations needed to converge. In general, the adaptive strategies allow a reduction in the computational complexity with respect to the Vanilla ADMM, which is unable to reach the desired tolerance in 9 out of 15 instances. Among the 4 adaptive strategies, “Adaptive ADMM-BB1” seems to be the most effective, being able to outperform all the others in 9 out of 15 instances, and being the second best in 4 out of 15, performing on average 20 % of iterations more than the best in this case. Unlike what happened in the case of consensus logistic regression problems, here “Adaptive ADMM-BB2” and “Adaptive ADMM-ABB min ” appear to perform poorly, suggesting that the use of too small of values for τ k may slow down the convergence.

4. Conclusions

In this paper, we analyzed different strategies for the adaptive selection of the penalty parameter in ADMM. Exploiting the equivalence between ADMM and the Douglas–Rachford splitting method applied to the dual problem, as suggested in [7], optimal penalty parameters can be estimated at each iteration from spectral stepsizes of a gradient step applied to the Fenchel conjugates of the objective functions. To this end, we selected different spectral steplength strategies based on the Barzilai–Borwein rules, which have been proved to be very efficient in the context of smooth unconstrained optimization.
We compared the different adaptive strategies on the solution of problems coming from distributed machine learning and multiperiod portfolio optimization. The results show that, while adaptive versions of ADMM are usually more effective than the “vanilla” one (using a prefixed penalty parameter), different strategies might perform better on different problems. Moreover, in some cases, the proposed alternation rule might get stuck with a fixed penalty parameter, leading to a slower convergence. Future work will deal with the analysis of an improved version of the adaptation strategies, aimed at solving the aforementioned issue and their analysis on wider problem classes.

Author Contributions

All authors have contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by Istituto Nazionale di Alta Matematica-Gruppo Nazionale per il Calcolo Scientifico (INdAM-GNCS), by the Italian Ministry of University.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Found. Trends Mach. Learn. 2011, 3, 1–122. [Google Scholar] [CrossRef]
  2. Eckstein, J.; Bertsekas, D.P. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 1992, 55, 293–318. [Google Scholar] [CrossRef]
  3. He, B.S.; Yang, H.; Wang, S.L. Alternating direction method with self-adaptive penalty parameters for monotone variational inequalities. J. Optim. Theory Appl. 2000, 106, 337–356. [Google Scholar] [CrossRef]
  4. Ghadimi, E.; Teixeira, A.; Shames, I.; Johansson, M. On the Optimal Step-size Selection for the Alternating Direction Method of Multipliers. IFAC Proc. Vol. 2012, 45, 139–144. [Google Scholar] [CrossRef]
  5. Goldstein, T.; Li, M.; Yuan, X. Adaptive primal-dual splitting methods for statistical learning and image processing. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar]
  6. Song, C.; Yoon, S.; Pavlovic, V. Fast ADMM algorithm for distributed optimization with adaptive penalty. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
  7. Xu, Z.; Figueiredo, M.; Goldstein, T. Adaptive ADMM with spectral penalty parameter selection. In Proceedings of the Artificial Intelligence and Statistics, PMLR, Lauderdale, FL, USA, 20–22 April 2017; pp. 718–727. [Google Scholar]
  8. Barzilai, J.; Borwein, J.M. Two-point step size gradient methods. IMA J. Numer. Anal. 1988, 8, 141–148. [Google Scholar] [CrossRef]
  9. Fletcher, R. On the Barzilai-Borwein method. In Optimization and Control with Applications; Qi, L., Teo, K., Yang, X., Pardalos, P.M., Hearn, D., Eds.; Applied Optimization; Springer: New York, NY, USA, 2005; Volume 96, pp. 235–256. [Google Scholar]
  10. Zhou, B.; Gao, L.; Dai, Y.H. Gradient Methods with Adaptive Step-Sizes. Comput. Optim. Appl. 2006, 35, 69–86. [Google Scholar] [CrossRef]
  11. Raydan, M. The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 1997, 7, 26–33. [Google Scholar] [CrossRef]
  12. Crisci, S.; Ruggiero, V.; Zanni, L. Steplength selection in gradient projection methods for box-constrained quadratic programs. Appl. Math. Comput. 2019, 356, 312–327. [Google Scholar] [CrossRef]
  13. Crisci, S.; Kružík, J.; Pecha, M.; Horák, D. Comparison of active-set and gradient projection-based algorithms for box-constrained quadratic programming. Soft Comput. 2020, 24, 17761–17770. [Google Scholar] [CrossRef]
  14. Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1997; Volume 11. [Google Scholar]
  15. Esser, E. Applications of Lagrangian-based alternating direction methods and connections to split Bregman. CAM Rep. 2009, 9, 31. [Google Scholar]
  16. Pontryagin, L.S. Linear differential games. SIAM J. Control Optim. 1974, 12, 262–267. [Google Scholar] [CrossRef]
  17. Nurminski, E.; Uryasev, S. Yet Another Convex Sets Subtraction with Application in Nondifferentiable Optimization. arXiv 2018, arXiv:1801.06946. [Google Scholar]
  18. Frassoldati, G.; Zanni, L.; Zanghirati, G. New adaptive stepsize selections in gradient methods. J. Ind. Manag. Optim. 2008, 4, 299–312. [Google Scholar] [CrossRef]
  19. Bonettini, S.; Zanella, R.; Zanni, L. A scaled gradient projection method for constrained image deblurring. Inverse Probl. 2009, 25. [Google Scholar] [CrossRef]
  20. Shaho Alaviani, S.; Kelkar, A.G. Parallel Alternating Direction Primal-Dual (PADPD) Algorithm for Multi-Block Centralized Optimization. J. Comput. Inf. Sci. Eng. 2023, 23, 051010. [Google Scholar] [CrossRef]
  21. Brodie, J.; Daubechies, I.; DeMol, C.; Giannone, D.; Loris, I. Sparse and stable Markowitz portfolios. Proc. Natl. Acad. Sci. USA 2009, 30, 12267–12272. [Google Scholar] [CrossRef]
  22. Corsaro, S.; De Simone, V. Adaptive l1-regularization for short-selling control in portfolio selection. Comput. Optim. Appl. 2019, 72, 457–478. [Google Scholar] [CrossRef]
  23. Corsaro, S.; De Simone, V.; Marino, Z.; Perla, F. L1-regularization for multi-period portfolio selection. Ann. Oper. Res. 2020, 294, 75–86. [Google Scholar] [CrossRef]
  24. Corsaro, S.; De Simone, V.; Marino, Z.; Scognamiglio, S. l1-Regularization in Portfolio Selection with Machine Learning. Mathematics 2022, 10, 540. [Google Scholar] [CrossRef]
  25. Corsaro, S.; De Simone, V.; Marino, Z. Fused Lasso approach in portfolio selection. Ann. Oper. Res. 2021, 299, 47–59. [Google Scholar] [CrossRef]
  26. Corsaro, S.; De Simone, V.; Marino, Z. Split Bregman iteration for multi-period mean variance portfolio optimization. Appl. Math. Comput. 2021, 392, 125715. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Relative residual (left) and τ k history (right) against number of iterations for the 4 consensus logistic regression instances. From top to bottom: a9a, cod-rna, ijcnn1, phishing.
Figure 1. Relative residual (left) and τ k history (right) against number of iterations for the 4 consensus logistic regression instances. From top to bottom: a9a, cod-rna, ijcnn1, phishing.
Algorithms 16 00264 g001
Table 1. Number of training pairs M and problem dimension n for each dataset.
Table 1. Number of training pairs M and problem dimension n for each dataset.
NameMn
a9a22,793123
cod-rna41,6758
ijcnn134,99422
phishing773968
Table 2. Performance for portfolio FF48 with 10-year simulation.
Table 2. Performance for portfolio FF48 with 10-year simulation.
MethodIterationsDensityRatio
τ 0 = 0.1
Vanilla ADMM8460.0962.064
Adaptive ADMM [7]3210.0962.064
Adaptive ADMM-BB11900.0962.064
Adaptive ADMM-BB24580.0962.064
Adaptive ADMM-ABB min 5090.0962.064
τ 0 = 0.5
Vanilla ADMM3000
Adaptive ADMM [7]4880.0962.064
Adaptive ADMM-BB13400.0962.064
Adaptive ADMM-BB26440.0962.064
Adaptive ADMM-ABB min 6650.0962.064
τ 0 = 1
Vanilla ADMM3000
Adaptive ADMM [7]2830.0962.064
Adaptive ADMM-BB12320.0962.064
Adaptive ADMM-BB26170.0962.064
Adaptive ADMM-ABB min 6180.0962.064
Table 3. Performance for portfolio FF48 with 20-year simulation.
Table 3. Performance for portfolio FF48 with 20-year simulation.
MethodIterationsDensityRatio
τ 0 = 0.1
Vanilla ADMM8710.1112.430
Adaptive ADMM [7]5180.1112.430
Adaptive ADMM-BB14700.1112.430
Adaptive ADMM-BB216980.1112.430
Adaptive ADMM-ABB min 2980.1112.430
τ 0 = 0.5
Vanilla ADMM3000
Adaptive ADMM [7]4540.1122.430
Adaptive ADMM-BB14500.1112.430
Adaptive ADMM-BB212910.1112.430
Adaptive ADMM-ABB min 10490.1112.430
τ 0 = 1
Vanilla ADMM3000
Adaptive ADMM [7]4540.1112.430
Adaptive ADMM-BB13610.1112.430
Adaptive ADMM-BB212520.1112.430
Adaptive ADMM-ABB min 7750.1112.430
Table 4. Performance for portfolio FF48 with 30-year simulation.
Table 4. Performance for portfolio FF48 with 30-year simulation.
MethodIterationsDensityRatio
τ 0 = 0.1
Vanilla ADMM17000.1475.134
Adaptive ADMM [7]2630.1475.134
Adaptive ADMM-BB13310.1475.134
Adaptive ADMM-BB218060.1475.134
Adaptive ADMM-ABB min 2810.1475.134
τ 0 = 0.5
Vanilla ADMM3000
Adaptive ADMM [7]2700.1475.134
Adaptive ADMM-BB13270.1475.134
Adaptive ADMM-BB23240.1475.134
Adaptive ADMM-ABB min 2750.1465.134
τ 0 = 1
Vanilla ADMM3000
Adaptive ADMM [7]2520.1475.134
Adaptive ADMM-BB12690.1475.134
Adaptive ADMM-BB24580.1475.134
Adaptive ADMM-ABB min 5950.1465.134
Table 5. Performance for portfolio NASDAQ100 with 10-year simulation.
Table 5. Performance for portfolio NASDAQ100 with 10-year simulation.
MethodIterationsDensityRatio
τ 0 = 0.1
Vanilla ADMM14210.0232.005
Adaptive ADMM [7]11400.0232.005
Adaptive ADMM-BB19060.0232.005
Adaptive ADMM-BB23000
Adaptive ADMM-ABB min 3000
τ 0 = 0.5
Vanilla ADMM19840.0232.005
Adaptive ADMM [7]13680.0232.005
Adaptive ADMM-BB19470.0232.005
Adaptive ADMM-BB23000
Adaptive ADMM-ABB min 3000
τ 0 = 1
Vanilla ADMM3000
Adaptive ADMM [7]17890.0232.005
Adaptive ADMM-BB113070.0232.005
Adaptive ADMM-BB23000
Adaptive ADMM-ABB min 3000
Table 6. Performance for portfolio Dow Jones with 10-year simulation.
Table 6. Performance for portfolio Dow Jones with 10-year simulation.
MethodIterationsDensityRatio
τ 0 = 0.1
Vanilla ADMM12110.0651.370
Adaptive ADMM [7]5050.0651.370
Adaptive ADMM-BB15100.0651.370
Adaptive ADMM-BB26450.0651.370
Adaptive ADMM-ABB min 15730.0651.370
τ 0 = 0.5
Vanilla ADMM3000
Adaptive ADMM [7]9970.0651.370
Adaptive ADMM-BB16910.0651.370
Adaptive ADMM-BB210770.0651.370
Adaptive ADMM-ABB min 11610.0651.370
τ 0 = 1
Vanilla ADMM3000
Adaptive ADMM [7]5110.0651.370
Adaptive ADMM-BB16780.0651.370
Adaptive ADMM-BB29100.0651.370
Adaptive ADMM-ABB min 10610.0651.370
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Crisci, S.; De Simone, V.; Viola, M. On the Adaptive Penalty Parameter Selection in ADMM. Algorithms 2023, 16, 264. https://doi.org/10.3390/a16060264

AMA Style

Crisci S, De Simone V, Viola M. On the Adaptive Penalty Parameter Selection in ADMM. Algorithms. 2023; 16(6):264. https://doi.org/10.3390/a16060264

Chicago/Turabian Style

Crisci, Serena, Valentina De Simone, and Marco Viola. 2023. "On the Adaptive Penalty Parameter Selection in ADMM" Algorithms 16, no. 6: 264. https://doi.org/10.3390/a16060264

APA Style

Crisci, S., De Simone, V., & Viola, M. (2023). On the Adaptive Penalty Parameter Selection in ADMM. Algorithms, 16(6), 264. https://doi.org/10.3390/a16060264

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop