Next Article in Journal
A Study on the Beech Wood Machining Parameters Optimization Using Response Surface Methodology
Previous Article in Journal
The Single Axiomatization on CCRL-Fuzzy Rough Approximation Operators and Related Fuzzy Topology
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On the Convergence Properties of a Stochastic Trust-Region Method with Inexact Restoration

by
Stefania Bellavia 
1,2,†,
Benedetta Morini 
1,2,† and
Simone Rebegoldi
1,2,*,†
1
Dipartimento di Ingegneria Industriale, Università degli studi di Firenze, Viale G.B. Morgagni 40, 50134 Firenze, Italy
2
INDAM-GNCS Research Group, P.le Aldo Moro 5, 00185 Roma, Italy
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Axioms 2023, 12(1), 38; https://doi.org/10.3390/axioms12010038
Submission received: 29 October 2022 / Revised: 19 December 2022 / Accepted: 24 December 2022 / Published: 28 December 2022
(This article belongs to the Section Mathematical Analysis)

Abstract

:
We study the convergence properties of SIRTR, a stochastic inexact restoration trust-region method suited for the minimization of a finite sum of continuously differentiable functions. This method combines the trust-region methodology with random function and gradient estimates formed by subsampling. Unlike other existing schemes, it forces the decrease of a merit function by combining the function approximation with an infeasibility term, the latter of which measures the distance of the current sample size from its maximum value. In a previous work, the expected iteration complexity to satisfy an approximate first-order optimality condition was given. Here, we elaborate on the convergence analysis of SIRTR and prove its convergence in probability under suitable accuracy requirements on random function and gradient estimates. Furthermore, we report the numerical results obtained on some nonconvex classification test problems, discussing the impact of the probabilistic requirements on the selection of the sample sizes.
MSC:
65K05; 90C30; 90C15

1. Introduction

The solutions of large-scale finite-sum optimization problems have become essential in several machine learning tasks including binary or multinomial classification, regression, clustering, and anomaly detection [1,2]. Indeed, the training of models employed in such tasks is often performed by solving the optimization
min x R n f N ( x ) = 1 N i = 1 N ϕ i ( x ) ,
where N is the size of the available data set and the functions ϕ i : R n R are continuously differentiable for all i = 1 , , N . As a result, the efficient solution of a machine learning problem calls for efficient numerical algorithms for (1).
When the data set is extremely large, the evaluation of the objective function f N and its derivatives may be computationally demanding, making deterministic optimization methods inadequate for solving (1). A common strategy consists of approximating both the function and derivatives by employing a small number of loss functions ϕ i sampled randomly, making stochastic optimization methods the preferred choice [3,4,5]. A major issue is the sensitivity of most stochastic algorithms to their parameters, such as the learning rate or sample sizes used for building the function and gradient approximations, which usually need to be tuned through multiple trials and errors before the algorithm exhibits acceptable performance. A possible remedy to burdensome tuning is to employ adaptive optimization methods, which compute the parameters according to appropriate globalization strategies [6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]. Most of these methods have probabilistic accuracy requirements for the function and gradient estimates in order to ensure either the iteration complexity of the expectation [7,8,9,10,11,13,15,19,21,22] or the convergence in probability of the iterates [6,12,16,17,18,20]. In turn, these requirements are reflected in the choice of sample size, which needs to progressively grow as the iterations proceed, resulting in an increasing computational cost per iteration.
In [11], the authors proposed the so-called stochastic inexact restoration trust-region (SIRTR) method for solving (1). SIRTR employs subsampled function and gradient estimates and combines the classical trust-region scheme with the inexact restoration method for constrained optimization problems [23,24,25]. This combined strategy involves the reformulation of (1) as an optimization problem with two unknown variables x , M , where x is the object to be recovered and M is the sample size of the function estimate, upon which the constraint M = N is imposed. Based on this reformulation of (1), the method acts on the two variables in a modular way: first, it selects the sample size with a deterministic rule aimed at improving feasibility with respect to the constraint M = N ; then, it accepts or rejects the inexact trust-region step by improving optimality with respect to a suitable merit function. SIRTR has shown good numerical performance on a series of classification and regression test problems, as its inexact restoration strategy drastically reduces the computational burden due to the selection of the algorithmic parameters. From a theoretical viewpoint, the authors in [11] provided an upper bound on the expected number of iterations to reach a near-stationary point under some appropriate probability accuracy requirements on the random estimators; remarkably, such requirements are less stringent than others employed in the literature. However, the convergence in probability of SIRTR remains unproved, thus leaving open the question of whether the gradient of the objective function in (1) converges to zero with a probability of one. A positive answer to this question would be an important theoretical confirmation of the numerical stability of the method.
In this paper, we improve on the existing theoretical analysis of SIRTR, showing that its iterates drive the gradient to zero with a probability of one. The results will be obtained by combining the theoretical properties of SIRTR with some tools from martingale theory, as typically done for the convergence analysis of adaptive stochastic methods [6,16,18]. Furthermore, we show the numerical results obtained by applying SIRTR on nonconvex binary classification, discussing the impact of the probabilistic accuracy requirements on the performance of the method.
The paper is structured as follows. In Section 2, we briefly outline the method and its main steps. In Section 3, we perform the convergence analysis of the method, showing that its iterates converge with a probability of one. In Section 4, we provide a numerical illustration of a binary classification test problem. Finally, we report the conclusions and future work in Section 5.
Notations: Throughout the paper, R is the set of real numbers, whereas the symbol · denotes the standard Euclidean norm on R n . We denote with ( Ω , A , P ) a probability space, where Ω is the sample space, A P ( Ω ) is the σ algebra of events, and P : A [ 0 , 1 ] is the probability function. Given an event A A , the symbol P ( A ) stands for the probability of the event A, and 𝟙 A : Ω { 0 , 1 } denotes the indicator function of an event A, i.e., the function such that 𝟙 A ( ω ) = 1 if ω A , or 𝟙 A ( ω ) = 0 otherwise. Given a random variable X : Ω R , we denote with E ( X ) the expected value of X. Let X 1 , , X n be n random variables, and the notation σ ( X 1 , , X n ) stands for the σ algebra generated by X 1 , , X n .

2. The SIRTR Method

Here, we present the stochastic inexact restoration trust-region (SIRTR) method, which was formerly proposed in [11]. SIRTR is a trust-region method with subsampled function and gradient estimates, which combines first-order trust-region methodology with the inexact restoration method for constrained optimization [25]. In order to provide a detailed description of SIRTR, we reformulate (1) as the constrained problem
min x R n f M ( x ) = 1 M i I M ϕ i ( x ) , s . t . M = N
where I M { 1 , , N } is a sample set of the size of the cardinality | I M | such that | I M | = M . To measure the infeasibility distance of M from the constraint M = N , we introduce a function h that measures the distance of M { 1 , , N } from N. Function h is supposed to satisfy the following properties.
Assumption 1.
The function h : { 1 , 2 , , N } R is monotonically decreasing and satisfies h ( 1 ) > 0 , h ( N ) = 0 .
From Assumption 1, the existence of some positive constants h ̲ and h ¯ follows such that
h ̲ h ( M ) h ¯ , for 1 M N .
An example of a function h : { 1 , , N } R satisfying Assumption 1 is h ( M ) = N M N .
SIRTR is a stochastic variant of the classical first-order trust-region method, which accepts the trial point according to the decrease of a convex combination of the function estimate f M with function h. We report SIRTR in Algorithm 1.
At the beginning of each iteration k 0 , we have at our disposal the iterate x k , the trust-region radius δ k , the sample size N k { 1 , , N } , the penalty parameter θ k , and the flag iflag, where iflag = succ if the previous iteration was successful, in the sense that is specified below, and iflag = unsucc otherwise. Then, in Steps 1–5, perform the following tasks.
  • In Step 1, if iflag = succ, we reduce the current value h ( N k ) of the infeasibility measure and find some N ˜ k + 1 { 1 , , N } satisfying h ( N ˜ k + 1 ) r h ( N k ) with r ( 0 , 1 ) . On the other hand, if iflag=unsucc, N ˜ k + 1 remains the same from one iteration to the other, i.e., we set N ˜ k + 1 = N ˜ k . Note that N ˜ k + 1 = N if N k = N .
  • Step 2 determines a trial sample size N k + 1 t that satisfies h ( N k + 1 t ) h ( N ˜ k + 1 ) μ δ k 2 with μ > 0 and is used to form the random model. In principle, we could fix N k + 1 t = N ˜ k + 1 , but selecting a smaller sample size, if possible, yields a computational saving in the successive step. The relation between N k + 1 t and N ˜ k + 1 depends on δ k ; small values of δ k give N k + 1 t = N ˜ k + 1 ; otherwise, N k + 1 t is allowed to be smaller than N ˜ k + 1 .
  • Step 3 forms the random model m k ( p ) and the trial step p k . The linear model is given by m k ( p ) = f N k + 1 t ( x k ) + g k T p , where
    f N k + 1 t ( x k ) = 1 N k + 1 t i I N k + 1 t ϕ i ( x k ) ,
    and
    g k = 1 N k + 1 , g i I N k + 1 , g ϕ i ( x k ) ,
    with I N k + 1 , g { 1 , , N } with cardinality | I N k + 1 , g | = N k + 1 , g .
Minimizing m k over the ball of center 0 and radius δ k gives the trial step
p k = argmin p δ k m k ( p ) = δ k g k g k .
Algorithm 1 Stochastic Inexact Restoration Trust-Region (SIRTR)
Given x 0 R n , N 0 { 1 , , N } , η 1 ( 0 , 1 ) , θ 0 ( 0 , 1 ) , r ( 0 , 1 ) , γ > 1 , μ > 0 , η 2 > 0 , 0 < δ 0 < δ max .
0.
Set k = 0 , iflag=succ.
1.
Reference sample size
If iflag=succ
      find N ˜ k + 1 such that N k N ˜ k + 1 N and
h ( N ˜ k + 1 ) r h ( N k ) ,
Else
      set N ˜ k + 1 = N ˜ k .
2.
Trial sample size
If N k = N
      set N k + 1 t = N
Else
      find N k + 1 t such that
h ( N k + 1 t ) h ( N ˜ k + 1 ) μ δ k 2 .
3.
Trust-region model
Choose I N k + 1 t { 1 , , N } such that | I N k + 1 t | = N k + 1 t .
Choose N k + 1 , g and I N k + 1 , g { 1 , , N } such that | I N k + 1 , g | = N k + 1 , g .
Compute g k as in (5), and set p k = δ k g k g k .
Compute f N k + 1 t ( x k ) as in (4), and set m k ( p k ) = f N k + 1 t ( x k ) + g k T p k .
4.
Penalty parameter
If Pred k ( θ k ) η 1 ( h ( N k ) h ( N ˜ k + 1 ) )
      set
θ k + 1 = θ k
Else
      set
θ k + 1 = ( 1 η 1 ) ( h ( N k ) h ( N ˜ k + 1 ) ) m k ( p k ) f N k ( x k ) + h ( N k ) h ( N ˜ k + 1 ) .
5.
Acceptance test
If Ared k ( x k + p k , θ k + 1 ) η 1 Pred k ( θ k + 1 ) and g k η 2 δ k (success)
     define
x k + 1 = x k + p k δ k + 1 = min γ δ k , δ max
      set N k + 1 = N k + 1 t , k = k + 1 , iflag=succ and go to Step 1.
Else (unsuccess)
     define
x k + 1 = x k δ k + 1 = δ k γ ,
      set N k + 1 = N k , k = k + 1 , iflag=unsucc and go to Step 1.
  • In Step 4, we compute the penalty parameter θ k + 1 ( 0 , 1 ) that governs the predicted reduction Pred k in the function and infeasibility measure, which we define as
    Pred k ( θ ) = θ ( f N k ( x k ) m k ( p k ) ) + ( 1 θ ) ( h ( N k ) h ( N ˜ k + 1 ) ) .
If θ = θ k satisfies
Pred k ( θ ) η 1 ( h ( N k ) h ( N ˜ k + 1 ) ) ,
then, we set θ k + 1 = θ k ; otherwise, we compute θ k + 1 as the biggest value for which Inequality (12) is satisfied and takes the explicit form given in (8).
  • In Step 5, we establish if we accept (success) or reject (unsuccess) the trial point x k + p k . The actual reduction Ared k at point x ^ is defined as
    Ared k ( x ^ , θ ) = θ ( f N k ( x k ) f N k + 1 t ( x ^ ) ) + ( 1 θ ) ( h ( N k ) h ( N k + 1 t ) ) ,
    and we declare a successful iteration whenever the following conditions are both met:
    Ared k ( x k + p k , θ k + 1 ) η 1 Pred k ( θ k + 1 )
    g k η 2 δ k .
    Condition (14) reduces to the standard acceptance criterion of deterministic trust-region methods when N k = N ˜ k + 1 = N k + 1 t = N . If both conditions are satisfied, we accept the step p k and set x k + 1 = x k + p k , increase the trust-region radius based on the update rule (9), and set N k + 1 = N k + 1 t , iflag=succ; otherwise, we retain the previous iterate, i.e., x k + 1 = x k , reduce the trust-region radius according to (10), and set N k + 1 = N k , iflag = unsucc.

3. Convergence Analysis

In this section, we are interested in the convergence properties of Algorithm 1. To this aim, we note that the function estimates f N k + 1 t ( x k ) in (4) and gradient estimates g k in (5) are all random quantities. Consequently, Algorithm 1 generates a random process, that is, the iterates X k , the trust region radii Δ k , the gradient estimates G k , f N k + 1 t ( X k ) , and the values Ψ k of the Lyapunov function Ψ in (21) at iteration k are to be considered as random variables, with their realizations denoted as x k , δ k , g k , and ψ k .
Our aim is to show the convergence in probability of the iterates generated by Algorithm 1, in the sense that
P lim k f ( X k ) = 0 = 1 ,
i.e., the event lim k f ( X k ) = 0 holds almost surely. We note that the authors in [11] derived a bound on the expected number of iterations in Algorithm 1 required to reach the desired accuracy in the gradient norm, but did not show the convergence results of Type (16).

3.1. Preliminary Results

We recall some technical preliminary results that were obtained for Algorithm 1 in [11]. First, we impose some basic assumptions on the functions in Problem (1).
Assumption 2.
(i) 
Each function ϕ i : R n R is continuously differentiable for i = 1 , , N .
(ii) 
The functions f M : R n R , M = 1 , , N , are bounded from below on R n , i.e., there exists f l o w R such that
f M ( x ) f l o w , 1 M N , x R n .
(iii) 
The functions f M : R n R , M = 1 , , N , are bounded from above on a subset Ω R n , i.e., there exists f u p R such that
f M ( x ) f u p , 1 M N , x Ω .
Furthermore, the iterates { x k } k N defined by Algorithm 1 are contained in Ω.
Combining Step 4 of Algorithm 1 with Bound (3) and Assumption 2(iii), it is possible to prove that for any realizations of the algorithm, the sequence { θ k } k N is bounded away from zero.
Lemma 1 
([11], Lemma 2). Let Assumptions 1 and 2 hold and consider a particular realization of Algorithm 1. Let κ ϕ > 0 be defined as follows:
κ ϕ = max { | f l o w | , | f u p | } .
Then, { θ k } k N is a positive, non-increasing sequence such that
θ k θ ̲ = min θ 0 , ( 1 η 1 ) ( 1 r ) h ̲ 2 k ϕ + h ¯ , k 0 .
Furthermore, Condition (12) holds with θ = θ k + 1 .
Since the acceptance test in Algorithm 1 employs function and gradient estimates, we cannot expect that the objective function f N is decreased from one iteration to the other; however, the authors in [11] showed that an appropriate Lyapunov function Ψ is reduced at each iteration. This Lyapunov function is defined as
Ψ ( x , M , θ , δ ) = v θ f M ( x ) + ( 1 θ ) h ( M ) + θ Σ + ( 1 v ) δ 2 ,
where v ( 0 , 1 ) and Σ R are any constants that satisfy
f N k ( x ) h ( N k ) + Σ 0 , x Ω , k 0 ,
where such a constant exists thanks to Bound (3) and Assumption 2(ii). For all k 0 , we denote the values of Ψ along the iterates of Algorithm 1 as follows:
ψ k = Ψ ( x k , N k , θ k , δ k ) , k 0 .
Thanks to (20) and the positive sign of h (see Assumption 1), we can easily deduce that the sequence { ψ k } k N is non-negative, indeed
ψ k v θ k f N k ( x k ) + ( 1 θ k ) h ( N k ) + θ k ( f N k ( x k ) + h ( N k ) ) = v h ( N k ) 0 .
Furthermore, the difference between two successive values ψ k + 1 and ψ k can be easily rewritten as
ψ k + 1 ψ k = v θ k + 1 ( f N k + 1 ( x k + 1 ) f N k ( x k ) ) + ( 1 θ k + 1 ) ( h ( N k + 1 ) h ( N k ) ) + v ( θ k + 1 θ k ) ( f N k ( x k ) h ( N k ) + Σ ) + ( 1 v ) ( δ k + 1 2 δ k 2 ) .
If k is a successful iteration, then N k + 1 = N k + 1 t . By recalling (20) and the fact that the sequence { θ k } k N is monotone non-increasing (see Lemma 1), then, Equality (23) yields
ψ k + 1 ψ k v Ared k ( x k + 1 , θ k + 1 ) + ( 1 v ) ( δ k + 1 2 δ k 2 ) .
Otherwise, Algorithm 1 sets x k + 1 = x k and N k + 1 = N k . By inserting these updates in (23), together with (20) and the fact that { θ k } k N is non-increasing, we obtain
ψ k + 1 ψ k ( 1 v ) ( δ k + 1 2 δ k 2 ) .
Using (24) and (25) in combination with Step 5 of Algorithm 1, we can prove the following results.
Theorem 1 
([11], Theorem 1). Let Assumptions 1–2 hold and consider a particular realization of Algorithm 1. In (19), choose v ( v , 1 ) , where v is defined by
v = max ( γ 2 1 ) δ max 2 η 1 2 ( 1 r ) h ̲ + ( γ 2 1 ) δ max 2 , γ 2 1 η 1 η 2 θ ̲ + γ 2 1 .
Then, there exists a constant σ = σ ( v ) > 0 such that
ψ k + 1 ψ k σ δ k 2 , f o r   a l l   k 0 ,
hence, the sequence { δ k } in Algorithm 1 satisfies
lim k δ k = 0 .
We now introduce a Lipschitz continuity assumption on the gradients of the functions ϕ i appearing in (1).
Assumption 3.
Each gradient ϕ i is L i Lipschitz continuous for i = 1 , , N . We use the notation L = 1 2 max 1 i N L i .
The gradient estimates are bounded under Assumptions 2 and 3, as stated in the following lemma.
Lemma 2 
([11], Lemma 5). Let Assumptions 2 and 3 hold. Then, there exists g max such that for any realization of Algorithm 1
g k g max , k 0 ,
where g m a x = 8 L κ ϕ , and κ ϕ is given in (17).
Let us introduce the following events
G k , 1 = { f N ( X k ) G k ν Δ k } ,
G k , 2 = { f N ( X k ) f N k + 1 t ( X k ) ν Δ k } ,
where ν is a positive parameter. By using a similar terminology to the one employed in [22], the iteration k is said to be true if the events G k , 1 and G k , 2 are both true.
The next lemma shows that k is successful whenever the iteration k is true and the trust-region radius δ k is sufficiently small. This result is crucial for the analysis in the next section.
Lemma 3 
([11], Lemma 6) . Let Assumptions 1–3 hold and set η 3 = δ max g max ( θ 0 ( 2 ν + L ) + ( 1 θ ̲ ) μ ) η 1 ( 1 η 1 ) ( 1 r ) h ̲ . Suppose that, for a particular realization of Algorithm 1, the iteration k is true and the following condition holds
δ k < min g k η 2 , g k η 3 , ( 1 η 1 ) g k 2 ν + L .
Then, iteration k is successful.

3.2. Novel Convergence Results

Here, we derive two novel convergence results in probability holding for Algorithm 1. The results are provided under the assumption that the random variables G k and f N k + 1 t ( X k ) are sufficiently accurate estimators of the true gradient at X k , in the probabilistic sense specified below.
Assumption 4.
Let F k 1 = σ ( G 0 , , G k 1 , f N 1 t ( X 0 ) , , f N k t ( X k 1 ) ) . Then, the events G k , 1 , G k , 2 are true with sufficiently high probability conditioned to F k 1 , and the estimators G k and f N k + 1 t ( X k ) are conditionally independent random variables given F k 1 , i.e.,
P ( G k , 1 | F k 1 ) = π 1 , P ( G k , 2 | F k 1 ) = π 2 , and π 3 = π 1 π 2 > 1 2 .
First, we provide a liminf-type convergence result for SIRTR, which shows that the gradient of the objective function converges in probability to zero relative to a subsequence of the iterates.
Theorem 2.
Suppose that Assumptions 1–4 hold. Then, there holds
P lim inf k f ( X k ) = 0 = 1 .
Proof. 
The proof parallels that in ([16], Theorem 4.16). By contrast, assume that there exists ϵ > 0 such that the event
f ( X k ) ϵ , k 0
holds with positive probability. Then, let { x k } k N be a realization of { X k } k N such that f ( x k ) ϵ for all k 0 , and { δ k } k N is the corresponding realization of { Δ k } k N . From Theorem 1, we know that lim k δ k = 0 ; therefore, there exists k ¯ such that
δ k < b = min ϵ 2 ν , ϵ 2 η 2 , ϵ 2 η 3 , ϵ ( 1 η 1 ) 2 ( 2 ν + L ) , δ max γ , k k ¯ .
Consider the random variable R k with realizations given by
r k = log γ δ k b , k 0 .
Note that r k satisfies the following properties.
(i)
If k k ¯ , then r k 0 ; this is a consequence of (34).
(ii)
If k is a true iteration and k k ¯ , then r k + 1 = r k + 1 ; indeed, since G k , 1 is true and δ k < ϵ / ( 2 ν ) , it follows that
g k f N ( x k ) ν δ k < ϵ 2 .
Then, f ( x k ) ϵ yields
g k ϵ 2 ,
which, combined with (34), implies that δ k satisfies Inequality (31). Thus, Lemma 3 implies that the iteration k is successful. Since δ k δ max / γ and the k-th iteration is successful, by (9) it follows that δ k + 1 = γ δ k . Hence, r k + 1 = r k + 1 .
(iii)
If k is not a true iteration and k k ¯ , then r k + 1 r k 1 ; this is because since we cannot apply Lemma 3 (k is not true), all we can say about the trust-region radius is that δ k + 1 δ k / γ .
Then, defining the σ -algebra F k 1 G = σ ( 𝟙 G 0 , 1 · 𝟙 G 0 , 2 , , 𝟙 G k 1 , 1 · 𝟙 G k 1 , 2 ) , which is included in F k 1 , it follows from properties (ii)–(iii) and Assumption 4 that
E ( R k + 1 | F k 1 G ) π 1 π 2 ( R k + 1 ) + ( 1 π 1 π 2 ) ( R k 1 ) R k ,
where the second inequality follows from π 1 π 2 > 1 2 . Hence, we have that { R k } k N is a submartingale. We also define the random variable
W k = i = 0 k ( 2 · 𝟙 G i , 1 · 𝟙 G i , 2 1 ) , k 0 .
{ W k } k N is also a submartingale as
E ( W k + 1 | F k 1 G ) = E ( W k | F k 1 G ) + 2 E ( 𝟙 G k + 1 , 1 · 𝟙 G k + 1 , 2 | F k 1 G ) 1 = W k + 2 P ( G k + 1 , 1 G k + 1 , 2 | F k 1 G ) 1 W k ,
where, again, the last inequality is due to the fact that π 1 π 2 > 1 2 . Since W k cannot have a finite limit, from ([16], Theorem 4.4) it follows that the event lim sup k W k = holds almost surely. Since we have r k r k 0 w k w k 0 by definition of { R k } k N and { W k } k N , it follows that R k has to be positive infinitely often with a probability of one. However, this contradicts property (i) listed above, which allows us to conclude that (33) cannot occur.    □
In the following, we show that SIRTR generates iterates such that the corresponding gradients evaluated at the SIRTR iterates converge (in probability) to zero. The next lemma is similar to ([6], Lemma 4.2); however, some crucial modifications are needed here; indeed, unlike in [6], we take into account the fact that SIRTR enforces the decrease in the Lyapunov function Ψ defined in (19) rather than the objective function.
Lemma 4.
Suppose that Assumptions 1–4 hold. Let { X k } and { Δ k } be the random sequences generated by Algorithm 1. For a fixed ϵ > 0 , define the following subset of natural numbers
{ K i } = { k 0 : f ( X k ) > ϵ } .
Then,
k { K i } Δ k < .
holds almost surely.
Proof. 
Let us consider the generic realizations { x k } k N , { g k } k N , { δ k } k N , { θ k } k N , and { k i } i N of Algorithm 1. Furthermore, we let { p i } be the subsequence of { k i } , where the iteration is true, whereas { n i } denotes the complementary subsequence so that { k i } = { p i } { n i } . First, we show that k { p i } δ k < . If { p i } is finite, then there is nothing to prove. Otherwise, since lim k δ k = 0 , there exists k ˜ such that δ k < b for all k k ˜ , where b is given in (34). Let us consider any p i k ˜ . Since G k , 1 is true, δ p i < ϵ / ( 2 ν ) , and f ( x p i ) > ϵ , we can reason as in (36) and (37) to conclude that g p i ϵ / 2 . Combining this lower bound with δ p i < b , we have that Inequality (31) is satisfied with k = p i . Hence, iteration p i is successful by Lemma 3 and we have
Ared k ( x p i + 1 , θ p i + 1 ) η 1 Pred k ( θ p i + 1 ) η 1 2 ( h ( N p i ) h ( N ˜ p i + 1 ) ) η 1 2 ( 1 r ) h ( N p i ) η 1 2 ( 1 r ) h ̲ g max δ max δ p i g p i ,
where the first inequality is the acceptance test (14), the second follows from Step 4 of the SIRTR algorithm, the third follows from (6), and the last follows from (3) and Lemma 2. Now, starting from Inequality (24) (which holds only for successful iterations), we can derive the following chain of inequalities
ψ p i ψ p i + 1 > v Ared k ( x p i + 1 , θ p i + 1 ) ( 1 v ) ( δ p i + 1 2 δ p i 2 ) v η 1 2 ( 1 r ) h ̲ g max δ max δ p i g p i ( 1 v ) ( γ 2 1 ) η 2 δ p i g p i = v η 1 2 ( 1 r ) h ̲ g max δ max + ( γ 2 1 ) η 2 ( γ 2 1 ) η 2 : = c δ p i g p i ,
where the second inequality follows from (39), (9), and (15). Now, recalling the definition of v given in (26), we choose v in (19) as
max v , ( γ 2 1 ) g max δ max η 1 2 η 2 ( 1 r ) h ̲ + ( γ 2 1 ) g max δ max < v < 1 ,
and, consequently, c in (40) is positive while keeping Theorem 1 still applicable. Then, plugging g p i   ϵ 2 into (40) yields
ψ p i ψ p i + 1 > ϵ c 2 δ p i .
Summing the previous inequality over k { p i } , k k ˜ , and noting that ψ k ψ k + 1 > 0 for any k (due to (27)), we obtain
k { p i } , k k ˜ δ k < 2 ϵ c k { p i } , k k ˜ ( ψ k ψ k + 1 ) lim K 2 ϵ c k = k ˜ K ( ψ k ψ k + 1 ) = lim k 2 ϵ c ( ψ k ˜ ψ K + 1 ) 2 ϵ c ψ k ˜ ,
where the last inequality follows from (22). Then, we have shown that
k { p i } δ k < .
Furthermore, let us introduce the Bernoulli variable B k = 2 · 𝟙 G k , 1 · 𝟙 G k , 2 1 , which takes a value of 1 when the iteration k is true and a value of 1 otherwise. Note that due to Assumption 4,
P ( B k = 1 | F k 1 ) > 1 2 .
Moreover, the sequence { Δ k } is a sequence of non-negative uniformly bounded random variables. Then, we can proceed as in the proof in ([6], Lemma 4.2), and using ([6], Lemma 4.1) we obtain
P k { P i } Δ k < k { N i } Δ k = = 0 .
This implies that almost surely
k { N i } Δ k < ,
hence, the thesis follows.    □
As a byproduct of the previous lemma, we obtain the expected convergence result in probability in the exact same way as in [6].
Theorem 3.
Suppose that Assumptions 1–4 hold. Let { X k } be the sequence of random iterates generated by Algorithm 1. Then, there holds
P lim k f ( X k ) = 0 = 1 .
Proof. 
The proof follows exactly as in ([6], Theorem 4.3).    □

4. Numerical Illustration

In this section, we evaluate the numerical performance of Algorithm 1 equipped with the probabilistic accuracy requirements imposed in Assumption 4. Algorithm 1 was implemented using MATLAB R2019a, and the numerical experiments were performed on an 8 GB RAM laptop with an Intel Core i7-4510U CPU 2.00-2.60 GHz processor. The related software can be downloaded from sites.google.com/view/optml-italy-serbia/home/software (accessed on 1 September 2022).
We perform our numerical experiments on a binary classification problem. Denoting with { ( a i , b i ) } i = 1 N a training set, where a i R n is the i-th feature vector, and b i { 0 , 1 } is the associated label, we address the following nonconvex optimization problem:
min x R n f N ( x ) = 1 N i = 1 N b i 1 1 + e a i T x 2 .
Note that (42) can be framed in Problem (1) by setting ϕ i ( x ) = ( b i 1 / ( 1 + e a i T x ) ) 2 , i = 1 , , N , namely the composition of the least-square loss with the sigmoid function. Furthermore, it is easy to see that the objective function f N satisfies Assumption 2 since each ϕ i is continuously differentiable and f N is bounded from below and above.
In Table 1, we report the four data sets used for our experiments. For each data set, we specify the number of feature vectors N, the number of components n of each feature vector, and the size N T of the testing set I N T .
We implement two different versions of Algorithm 1, which differ from one another in the way the two sample sizes N k + 1 t and N k + 1 , g for the estimators in (4) and (5) are selected.
  • SIRTR nop : this is Algorithm 1 implemented as in [11]. In particular, the infeasibility measure h and the initial penalty parameter θ 0 are chosen as follows:
    h ( M ) = N M N , θ 0 = 0.9 .
    In Step 1, the reference sample size N ˜ k + 1 is computed as follows:
    N ˜ k + 1 = min { N , c ˜ N k } ,
    where c ˜ = 1.05 . It is easy to see that Rule (43) complies with Condition (6) by setting r = ( N ( c ˜ 1 ) ) / N . In Step 2, the trial sample size N k + 1 t is chosen in compliance with Condition (7) as
    N k + 1 t = N ˜ k + 1 μ N δ k 2 , if N ˜ k + 1 μ N δ k 2 [ N 0 , 0.95 N ] N ˜ k + 1 , if N ˜ k + 1 μ N δ k 2 < N 0 N , if N ˜ k + 1 μ N δ k 2 > 0.95 N .
    In Step 3, the sample size N k + 1 , g is fixed as
    N k + 1 , g = c N k + 1 t ,
    where c = 0.1 . Furthermore, the set I N k + 1 t for computing f N k + 1 t ( x k ) and f N k + 1 t ( x k + p k ) is sampled uniformly at random using the MATLAB command randsample, whereas g k R n is a sample average approximation as in (5) using I N k + 1 , g I N k + 1 t . The other parameters are set as x 0 = ( 0 , 0 , , 0 ) T , δ 0 = 1 , δ max = 1 , γ = 2 , η = 10 1 , η 2 = 10 6 , μ = 100 / N . Note that Choices (44)–(45) for the sample sizes N k + 1 t , N k + 1 , g are not sufficient to guarantee that Assumption 4 holds so that Theorems 2–3 do not apply to this version of Algorithm 1.
  • SIRTR p : this implementation of Algorithm 1 differs from the previous one only in the choice of the sample sizes N k + 1 t , N k + 1 , g . In this case, we force these two parameters to comply with Assumption 4. According to ([29], Theorem 7.2, Table 7.1), a subsampled estimator f S ( x k ) = 1 S i I S ϕ i ( x k ) with sample size | I S | = S satisfies the probabilistic requirement
    P ( f S ( X k ) f N ( X k ) ν δ k | F k 1 ) π ,
    where π ( 0 , 1 ) if the sample size S complies with the following lower bound
    S N k + 1 χ , ν , π = min N , 4 χ ν δ k 2 χ ν δ k + 1 3 log n + 1 1 π ,
    where χ = 1 5 max i = 1 , , N a i . Based on the previous remark, we choose the samples sizes of SIRTR p as follows
    N k + 1 t = max { N k + 1 χ , ν , π , N ˜ k + 1 μ N δ k 2 } , if N ˜ k + 1 μ N δ k 2 [ N 0 , 0.95 N ] max { N k + 1 χ , ν , π , N ˜ k + 1 } , if N ˜ k + 1 μ N δ k 2 < N 0 N , if N ˜ k + 1 μ N δ k 2 > 0.95 N .
    N k + 1 , g = max N k + 1 χ , ν , π , c N k + 1 t .
    Setting π > 1 / 2 , choosing N k + 1 t , N k + 1 , g as in (48) and (49), and sampling I N k + 1 t and I N k + 1 , g uniformly and independently at random in { 1 , , N } , we guarantee that Assumption 4 holds with π 1 = π 2 = π , thus ensuring the convergence in probability of SIRTR p according to Theorems 2–3. For our tests, we set π = 3 / 4 and ν = 10 f N ( x 0 ) f N 0 ( x 0 ) .
For each data set, we perform 10 runs of both SIRTR nop and SIRTR p and assess their performances by measuring the following metrics:
  • training loss, given as f N ( x k ) with f N defined in (42);
  • testing loss, defined as
    f N T ( x k ) = 1 N T i I N T ϕ i ( x k ) ,
    where I N T denotes the testing set and N T its dimension;
  • classification error, defined as
    e k = 1 N T i I N T | b i b i p r e d ( x k ) | ,
    where b i denotes the true label of the i th feature vector of the testing set and b i p r e d ( x k ) = max { sign ( a i T x k ) , 0 } is the corresponding predicted label at iteration k.
We note that (42) can be seen as the optimization problem arising from training a neural network with no hidden layers and the sigmoid function as the activation function for the output layer. Then, as in [30,31], we measure the computational cost of evaluating the objective function and its gradient in terms of forward and backward propagations. Namely, we count the number of full function and gradient evaluations, by considering the computation of a single function ϕ i equivalent to 1 N forward propagations, and the evaluation of a single gradient ϕ i equivalent to 2 N propagations. Regarding SIRTR nop , we note that the computational cost per iteration is determined by N k + 1 t + N k + 1 , g N propagations since I N k + 1 , g I N k + 1 t . On the contrary, the computational cost of SIRTR p is determined by N k + 1 t N + 2 N k + 1 , g N propagations, as I N k + 1 t and I N k + 1 , g are sampled independently from one another. For both algorithms, the computational cost per iteration increases as the iterations proceed; indeed, since δ k tends to zero as k tends to infinity (Theorem 1), Rules (44)–(48) will eventually select the trial sample size N k + 1 t equal to the reference sample size N ˜ k + 1 , which is increasing geometrically. We expect that the computational cost increases faster in SIRTR p , as this algorithm also requires the gradient sample size N k + 1 , g to increase due to Conditions (47)–(49). Finally, we note that the computational cost per iteration of both algorithms is higher than that of the standard stochastic gradient algorithm, which is usually 2 N g N , with N g being a prefixed gradient sample size. However, the increasing sample sizes result in more accurate function and gradient approximations so the higher computational cost likely implies a larger reduction in the training loss f N per iteration, as seen in previous comparisons of SIRTR with a non-adaptive stochastic approach in [11].
In Figure 1, we show the decrease in the training loss, testing loss, and classification error (all averaged over the 10 runs) versus the average computational cost for the first 20 propagations. For most data sets, we observe that SIRTR nop performs comparably or even better than SIRTR p . However, for one of the four data sets (mnist) in SIRTR nop , the accuracy deteriorates after the first propagations, whereas SIRTR p provides a more accurate classification and a quite steady decrease in the average training loss, testing loss, and classification error. This different performance of the two algorithms can be explained by looking at Figure 2, which shows the increase in the percentage ratio 100 N k + 1 N of the sample size N k + 1 over the data set size N (averaged over the 10 runs) for both algorithms. As we can see, the sample size in SIRTR p rapidly increases to 60 % of the data set size in the first 50 iterations, whereas the same percentage is achieved by SIRTR nop after 150–200 iterations. Overall, we can conclude that the probabilistic requirements of Assumption 4 ensure the theoretical support for convergence in probability but might be excessively demanding. In fact, the numerical examples show that a slower increase in the sample size than that imposed by the probabilistic requirements of Assumption 4 provides a good trade-off between the computational cost and the accuracy in the classification.
In Figure 3, we test the sensitivity of SIRTR p with respect to the initial penalty parameter θ 0 by reporting the average training loss versus the average computational cost obtained with three different values of θ 0 . We observe that the performance of the algorithm is not considerably affected by the choice of this parameter, although large oscillations in the average training loss occur for smaller values of θ 0 in mnist when θ 0 = 0.1 . As a general comment, small initial values of θ 0 may not be convenient, as the sequence { θ k } is non-increasing and small values of θ k promote a decrease in the infeasibility measure h rather than a decrease in the training loss (see the definition of the actual reduction in (13)). Similar considerations can be made for SIRTR nop in Figure 4.

5. Conclusions

In this paper, we have proved the convergence in probability of a stochastic trust-region method based on the inexact restoration approach (SIRTR) under the assumption that the function and gradient estimates are sufficiently accurate with sufficiently high probability. This result is novel for SIRTR and agrees with other results obtained in the existing literature [16,18,19]. The numerical experiments on binary classification show that the probabilistic requirements improve the numerical stability of the algorithm, ensuring satisfactory accuracy for all data sets. Future work could address the replacement of the probabilistic requirements considered here with alternative strategies for ensuring convergence in probability, such as variance reduction techniques, or the development of a second-order version of SIRTR for improving accuracy based on approximations of the Hessian obtained through subsampling.

Author Contributions

Conceptualization, S.B. and B.M. and S.R.; methodology, S.B. and B.M. and S.R.; software, S.B. and B.M. and S.R.; validation, S.B. and B.M. and S.R.; writing—original draft preparation, S.B. and B.M. and S.R.; writing—review and editing, S.B. and B.M. and S.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been partially supported by the INdAM GNCS project “Ottimizzazione adattiva per il machine learning” (CUP_E55F22000270001) and the Mobility Project “Second order methods for optimization problems in Machine Learning” (ID: RS19MO05) within the frame of the executive Program of Cooperation in the Field of Science and Technology between the Italian Republic and the Republic of Serbia 2019–2022. The third author also acknowledges the financial support received from the IEA CNRS project entitled “VaMOS”.

Data Availability Statement

The data sets employed in this paper have been accessed on 1 September 2022 at the following repositories: http://www.csie.ntu.edu.tw/~cjlin/libsvm, http://yann.lecun.com/exdb/mnist, https://archive.ics.uci.edu/ml/index.php (accessed on 1 September 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
  2. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  3. Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 2018, 60, 223–311. [Google Scholar] [CrossRef]
  4. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  5. Schmidt, M.; Le Roux, N.; Bach, F. Minimizing Finite Sums with the Stochastic Average Gradient. Math. Program. 2017, 162, 83–112. [Google Scholar] [CrossRef] [Green Version]
  6. Bandeira, A.S.; Scheinberg, K.; Vicente, L.N. Convergence of trust-region methods based on probabilistic models. SIAM J. Optim. 2014, 24, 1238–1264. [Google Scholar] [CrossRef] [Green Version]
  7. Bellavia, S.; Gurioli, G.; Morini, B.; Toint, P.L. Adaptive regularization algorithm for nonconvex optimization using inexact function evaluations and randomly perturbed derivatives. J. Complex. 2022, 68, 101591. [Google Scholar] [CrossRef]
  8. Bellavia, S.; Gurioli, G.; Morini, B.; Toint, P.L. Trust-region algorithms: Probabilistic complexity and intrinsic noise with applications to subsampling techniques. EURO J. Comput. Optim. 2022, 10, 100043. [Google Scholar] [CrossRef]
  9. Bellavia, S.; Krejić, N.; Krklec Jerinkic, N. Subsampled Inexact Newton methods for minimizing large sums of convex functions. IMA J. Numer. Anal. 2020, 40, 2309–2341. [Google Scholar] [CrossRef]
  10. Bellavia, S.; Krejić, N.; Morini, B. Inexact restoration with subsampled trust-region methods for finite-sum minimization. Comp. Opt. Appl. 2020, 73, 701–736. [Google Scholar] [CrossRef]
  11. Bellavia, S.; Krejić, N.; Morini, B.; Rebegoldi, S. A stochastic first-order trust-region method with inexact restoration for finite-sum minimization. Comput. Optim. Appl. 2022. [Google Scholar] [CrossRef]
  12. Bergou, E.; Gratton, S.; Vicente, L.N. Levenberg–Marquardt Methods Based on Probabilistic Gradient Models and Inexact Subproblem Solution, with Application to Data Assimilation. SIAM-ASA J. Uncertain. 2016, 4, 924–951. [Google Scholar] [CrossRef] [Green Version]
  13. Bergou, E.H.; Diouane, Y.; Kunc, V.; Kungurtsev, V.; Royer, C.W. A subsampling line-search method with second-order results. INFORMS J. Optim. 2022, 4, 403–425. [Google Scholar] [CrossRef]
  14. Bollapragada, R.; Byrd, R.; Nocedal, J. Adaptive sampling strategies for stochastic optimization. SIAM J. Optim. 2018, 28, 3312–3343. [Google Scholar] [CrossRef]
  15. Bollapragada, R.; Byrd, R. Nocedal, Exact and Inexact Subsampled Newton Methods for Optimization. IMA J. Numer. Anal. 2019, 9, 545–578. [Google Scholar] [CrossRef] [Green Version]
  16. Chen, R.; Menickelly, M.; Scheinberg, K. Stochastic optimization using a trust-region method and random models. Math. Program. 2018, 169, 447–487. [Google Scholar] [CrossRef] [Green Version]
  17. di Serafino, D.; Krejic, N.; Krklec Jerinkic, N.; Viola, M. LSOS: Line-search Second-Order Stochastic optimization methods for nonconvex finite sums. arXiv 2021, arXiv:2007.15966v2. [Google Scholar] [CrossRef]
  18. Larson, J.; Billups, S.C. Stochastic derivative-free optimization using a trust region framework. Comput. Optim. Appl. 2016, 64, 619–645. [Google Scholar] [CrossRef]
  19. Paquette, C.; Scheinberg, K. A Stochastic Line Search Method with Expected Complexity Analysis. SIAM J. Optim. 2020, 30, 349–376. [Google Scholar] [CrossRef]
  20. Xu, P.; Roosta-Khorasani, F.; Mahoney, M.W. Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information. Math. Program. 2020, 184, 35–70. [Google Scholar] [CrossRef] [Green Version]
  21. Blanchet, J.; Cartis, C.; Menickelly, M.; Scheinberg, K. Convergence Rate Analysis of a Stochastic Trust Region Method via Submartingales. INFORMS J. Optim. 2019, 1, 92–119. [Google Scholar] [CrossRef] [Green Version]
  22. Cartis, C.; Scheinberg, K. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. 2018, 169, 337–375. [Google Scholar] [CrossRef] [Green Version]
  23. Birgin, E.G.; Krejić, N.; Martínez, J.M. Inexact restoration for derivative-free expensive function minimization and applications. J. Comput. Appl. Math. 2022, 410, 114193. [Google Scholar] [CrossRef]
  24. Bueno, L.F.; Friedlander, A.; Martínez, J.M.; Sobral, F.N.C. Inexact Restoration Method for Derivative-Free Optimization with Smooth Constraints. SIAM J. Optim. 2013, 23, 1189–1213. [Google Scholar] [CrossRef] [Green Version]
  25. Martínez, J.M. Pilotta, Inexact restoration algorithms for constrained optimization. J. Optim. Theory Appl. 2000, 104, 135–163. [Google Scholar] [CrossRef]
  26. Lichman, M. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 1 September 2022).
  27. LeCun, Y.; Bottou, L.; Bengio, Y. Haffner, The MNIST Database. Available online: http://yann.lecun.com/exdb/mnist (accessed on 1 September 2022).
  28. Chang, C.C.; Lin, C.J. LIBSVM: A Library for Support Vector Machines. Available online: http://www.csie.ntu.edu.tw/~cjlin/libsvm (accessed on 1 September 2022).
  29. Bellavia, S.; Gurioli, G.; Morini, B.; Toint, P.L. Adaptive regularization algorithms with inexact evaluations for nonconvex optimization. SIAM J. Optim. 2019, 29, 2881–2915. [Google Scholar] [CrossRef]
  30. Bellavia, S.; Gurioli, G. Stochastic analysis of an adaptive cubic regularization method under inexact gradient evaluations and dynamic Hessian accuracy. Optimization 2022, 71, 227–261. [Google Scholar] [CrossRef]
  31. Xu, P.; Roosta-Khorasani, F.; Mahoney, M.W. Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study. In Proceedings of the 2020 SIAM International Conference on Data Mining, Cincinnati, OH, USA, 7–9 May 2020. [Google Scholar]
Figure 1. From top to bottom row: data sets a8a, a9a, mnist, phishing. From left to right: average training loss, testing loss, and classification error versus average computational cost.
Figure 1. From top to bottom row: data sets a8a, a9a, mnist, phishing. From left to right: average training loss, testing loss, and classification error versus average computational cost.
Axioms 12 00038 g001
Figure 2. Average percentage ratio of the sample size N k + 1 over the data set size N versus iterations. Top row: a8a (left) and a9a (right). Bottom row: mnist (left) and phishing (right).
Figure 2. Average percentage ratio of the sample size N k + 1 over the data set size N versus iterations. Top row: a8a (left) and a9a (right). Bottom row: mnist (left) and phishing (right).
Axioms 12 00038 g002
Figure 3. Average training loss versus average computational cost of SIRTR p equipped with different values for the initial penalty parameter θ 0 . Top row: a8a (left) and a9a (right). Bottom row: mnist (left) and phishing (right).
Figure 3. Average training loss versus average computational cost of SIRTR p equipped with different values for the initial penalty parameter θ 0 . Top row: a8a (left) and a9a (right). Bottom row: mnist (left) and phishing (right).
Axioms 12 00038 g003
Figure 4. Average training loss versus average computational cost of SIRTR nop equipped with different values for the initial penalty parameter θ 0 . Top row: a8a (left) and a9a (right).Bottom row: mnist (left) and phishing (right).
Figure 4. Average training loss versus average computational cost of SIRTR nop equipped with different values for the initial penalty parameter θ 0 . Top row: a8a (left) and a9a (right).Bottom row: mnist (left) and phishing (right).
Axioms 12 00038 g004
Table 1. Data sets used.
Table 1. Data sets used.
Training SetTesting Set
Data Set N n N T
A8a [26]15,8871236809
A9a [26]22,7931239768
Mnist [27]60,00078410,000
phishing [28]7739683316
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bellavia , S.; Morini , B.; Rebegoldi, S. On the Convergence Properties of a Stochastic Trust-Region Method with Inexact Restoration. Axioms 2023, 12, 38. https://doi.org/10.3390/axioms12010038

AMA Style

Bellavia  S, Morini  B, Rebegoldi S. On the Convergence Properties of a Stochastic Trust-Region Method with Inexact Restoration. Axioms. 2023; 12(1):38. https://doi.org/10.3390/axioms12010038

Chicago/Turabian Style

Bellavia , Stefania, Benedetta Morini , and Simone Rebegoldi. 2023. "On the Convergence Properties of a Stochastic Trust-Region Method with Inexact Restoration" Axioms 12, no. 1: 38. https://doi.org/10.3390/axioms12010038

APA Style

Bellavia , S., Morini , B., & Rebegoldi, S. (2023). On the Convergence Properties of a Stochastic Trust-Region Method with Inexact Restoration. Axioms, 12(1), 38. https://doi.org/10.3390/axioms12010038

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop