1. Introduction
The original Nelder–Mead (NM) simplex algorithm was constructed in 1965 by Nelder and Mead [
1] as an improved version of the simplex method of Spendley, Hext and Himsworth [
2]. During the almost six decades that have passed, the Nelder–Mead algorithm has became a truly popular and widely used method for the minimization problem
in derivative-free optimization (see, e.g., [
3,
4,
5,
6,
7,
8,
9,
10,
11]).
In spite of its widespread use, only a few convergence results are known. They are of essentially two types. The first type of results are related to the convergence of function values at the vertices of the simplex sequence. They say nothing about the vertices. The most famous results of this type are due to Lagarias, Reeds, Wright and Wright [
12]. The other type of results are related to the convergence of simplex vertices or simplices. The first result of this type is due to Kelley [
13], who proved that under a certain (Armijo-type) descent condition, the accumulation points of the simplex sequence are critical points of
f. Recent results on the convergence of the simplex sequence to a common limit point are given in papers [
14,
15,
16].
Wright [
17,
18] raised several questions concerning the Nelder–Mead method:
- (a)
Do the function values at all vertices necessarily converge to the same value?
- (b)
Do all vertices of the simplices converge to the same point?
- (c)
Why is it sometimes so effective (compared to other direct search methods) in obtaining a rapid improvement in f?
- (d)
One failure mode is known (McKinnon [
19])—but are there other failure modes?
- (e)
Why, despite its apparent simplicity, should the Nelder–Mead method be difficult to analyze mathematically?
In fact, there are several recent examples that indicate different types of convergence behavior:
The function values at the simplex vertices may converge to a common limit value, while the function
f has no finite minimum and the simplex sequence is unbounded (Example 5 of [
14]; Examples 1 and 2 of [
15]; Example 4 of [
20]).
The simplex vertices may converge to a common limit point, but it is not a stationary point of
f (McKinnon [
19] and Example 3 of [
15]).
The simplex sequence may converge to a limit simplex of a positive diameter, resulting in different limit values of
f at the vertices at the limit simplex (Theorem 1 of [
21]; Example 4 of [
14]; Examples 4 and 5 of [
15]; Examples 1, 2 and 3 of [
20]).
The function values at the simplex vertices may converge to a common value, while the simplex sequence converges to a limit simplex with a positive diameter (Example of 5 [
20]).
These examples are answering questions (a), (b) and (d) of Wright negatively.
There are several variants of the original Nelder–Mead method [
1]. Here, we study two main versions of the Nelder–Mead algorithm: the ’original’ (unordered) method of Nelder and Mead [
1] (Algorithm 1) and the ’ordered’ version of Lagarias, Reeds, Wright and Wright [
12] (Algorithm 2). In
Section 2, we introduce both algorithms, while
Section 3 contains their matrix representations. In
Section 4, we study the connection and spectral properties of these matrices. In
Section 5, we prove convergence of the function values at the simplex vertices for both algorithms. These results are more general than those of Lagarias et al. [
12], while their proof is quite simple.
In
Section 6, we study the convergence of the generated simplex sequence. Here, we give necessary and sufficient conditions for the convergence of the simplex sequence (for both algorithms). This relates the convergence to the convergence of an infinite matrix product. We pay particular attention to the case when the simplex sequence converges to a common limit point. The results show why the convergence problem (questions (c) and (e) of Wright) is difficult. This paper closes with computational results, conclusions and open questions.
4. Connections and Properties
Here, we investigate the connection between the transformation matrices of both algorithms, their consequences and the spectral properties of the matrices.
The corresponding transformation matrices of Algorithms 1 and 2 are connected by orthogonal similarities. For any
,
with
and
with
.
With
and
, denote the
kth simplices of Algorithms 1 and 2, respectively. If a permutation matrix
exists such that
,
and
, then we can ask if
and
define the same simplex. The assumption implies that
and
. Hence,
and
Since
,
and
, we have
In the case of shrinking,
Note that
. Hence,
Since
, we have that
Hence, the simplices of iteration
are the same for the ordering of vertices. Since the continuation of the original Nelder–Mead method is not uniquely determined, unlike in the case of Algorithm 2, we can expect similar although not identical results for both algorithms.
The following results are obvious. For any
,
Lemma 1. The eigenvalues of are () and (). has the eigenvalues and () for . Both and have a diagonal Jordan form.
Corollary 1. Sequences and are uniformly bounded for and convergent for (see, e.g., [23,24,25]). Note that
and
, where
and
The first expression follows from a result on the singular values of companion matrices (see, e.g., Barnett [
26] or Kittaneh [
27]), while the second follows from a result of Montano, Salas and Soto [
28].
Matrices
,
,
,
,
(
) are left-stochastic. Hence,
We note that
(
) is an involution. If
is multiplied by a permutation matrix
(
), this property is lost. However, for
,
is a 6-involutory matrix, which is exploited in proving Theorem 3 (for
k-involutory matrices, see Trench [
29]).
The multiplication of
or
by a permutation matrix
or
P makes a significant change in the eigenstructure. The following results hold (see Theorem 3 and Corollary 1 of Galántai [
15]).
Theorem 1. - (i)
has at least one eigenvalue .
- (ii)
() has at least two eigenvalues .
- (iii)
If and , then has exactly eigenvalues .
- (iv)
For , has exactly one eigenvalue , while the remaining n eigenvalues are in the open unit disk.
- (v)
If and , then has exactly eigenvalues , while eigenvalues are in the open unit disk.
- (vi)
If and , then all the eigenvalues of are on the unit circle.
- (vii)
If and , then has a second eigenvalue in the interval and all its eigenvalues are in the annulus .
Corollary 2. The eigenvalues of () are and for .
The change in the eigenstructure of
is the most striking in contrast to that of
. It gives better convergence properties for Algorithm 2, as shown in
Section 6 and
Section 7.
5. Convergence of the Function Values at the Vertices
Lagarias et al. [
12] studied the convergence of function values at the simplex vertices for Algorithm 2. We cite their two most important results, before proving more general results for both Algorithms.
Lemma 2 (Lemma 3.3 of [
12])
. If function f is bounded from below on and only a finite number of shrink iterations occur, then each sequence (generated by Algorithm 2) converges to some limit for and . Lemma 2 guarantees the convergence under rather general conditions. In general, the limit values
can be different as shown by Examples 1 and 2 of [
20] (also see [
15,
21]).
Theorem 2 (Theorem 5.1 of [
12])
. Assume that f is a strictly convex function on with bounded level sets. Assume that is non-degenerate. Then, the Nelder–Mead simplex method (Algorithm 2) converges in the sense that . Theorem 2 is generally considered as the main result of [
12]. In addition, it can be proved (see [
20]) that if
f is strictly convex on
with bounded level sets,
is non-degenerate (affine invariant); thus, any accumulation point
of the simplex sequence
(generated by Algorithm 2) has the form
.
We need the following concept and results.
Definition 1. Let f be defined on a convex set . The function f is said to be quasiconvex on S iffor every and for every . Convexity implies quasiconvexity.
Lemma 3 (Lemma 6 of [
20])
. If f is quasiconvex and bounded below on , then each sequence is monotone decreasing and converges to some limit for . Furthermore, . Corollary 3 (Corollary 2 of [
20])
. If f is convex on and an index exists such that , then and no shrinking occurs in iteration k. Lemma 4 (Lemma 10 of [
20])
. Assume that f is convex and bounded below on . If for some integer ℓ (),then there is an index such that for all , , that is, the first ℓ vertices do not change. Lemma 5 (Lemma 11 of [
20])
. If f is convex and bounded below on , then . The following generalization of Theorem 2 of Lagarias et al. was proved in [
20]. Here, we present it with a new short proof.
Theorem 3. Assume that f is a convex function on and is bounded below. Then, the Nelder–Mead simplex method (Algorithm 2) converges in the sense that .
Proof. It follows (Lemmas 3 and 5) that . Assume that . It follows from Lemma 4 that there exists an index such that for , . Hence, only and may change. The insertion rule and the impossibility of shrinking (Corollary 3) imply that only the following cases are possible:
- (i)
;
- (ii)
and ;
- (iii)
and .
We assume that
is big enough, so that for
,
,
(
), where
is such that
. Depending on the selected case,
,
or
must hold. In case (i),
replaces
and
. In case (ii), since
, we have
Hence, case (ii) cannot occur. Similarly, in case (iii),
showing that case (iii) cannot occur. Hence, we can only have the simplex sequence
Since is a 6-involutory matrix (), is a periodic sequence and , which is impossible by the insertion rule. This is a contradiction; thus, . □
For Algorithm 1, we prove the following results. First, observe the following facts. For nonshrink steps, we have
- (i)
for , ;
- (ii)
() and ;
- (iii)
;
- (iv)
and .
Hence, ().
Lemma 6 (Algorithm 1). If is bounded below and only a finite number of shrink operations occur, then there exist finite limit values such that ().
The only difference to Lemma 2 is that the function values and their limits are not ordered.
Lemma 7 (Algorithm 1). If is quasiconvex on and bounded below, then there exist finite limit values such that ().
Proof. If
f is quasiconvex, then in the case of shrinking, the inequality
also holds. So does Lemma 6. □
Lemma 8 (Algorithm 1). If is strictly convex, then no shrinking occurs.
Proof. Shrinking happens if or holds. In general, . If , then . If , then . □
Remark 1 (Algorithm 1). Assume that is convex. Then, . If , then . If , then . However, if at least for one , holds, then and . In such a case, there is no shrinking.
Assume that
is bounded below and only a finite number of shrink operations occur. Then, we have finite numbers
such that
as
, and each
is monotone decreasing (
). The values
are unordered. However, there is permutation
of the indices
such that
Lemma 9 (Algorithm 1). If f is convex and bounded below on , then .
Proof. Assume that
. Lemma 6 and Remark 1 imply that for
, there exists an index
such that for
,
Note that
(
). Hence, there cannot be shrinking, and only the worst vertex
can change for
(
). Clearly,
for
. If
, then
by definition. Hence, only the cases
are possible. For simplicity, let
. Then,
and
Since
(
), it follows that
and
Since
f is convex, we have
which is a contradiction. Hence,
holds under convexity. □
Theorem 4. If f is convex and bounded below on , then Algorithm 1 converges in the sense that .
Proof. Lemmas 6 and 9 imply that for some permutation of . Assume that . Then, there exists an index K such that for , () holds with . Since , only vertices and can change for . Also, note that implies that no shrinking may occur. It also follows that only are possible. Consider now the following ’possible’ cases:
- (i)
;
- (ii)
and ;
- (iii)
and .
In the first case,
and
. Hence, the operation
must be followed by
(
) for some
. In the second case,
which makes this particular step impossible. Similarly, in the third case,
this step is impossible. Hence, the only possible step is
with
and
. In fact, we have
for a pair of
and
, and this is repeated. Consider the following possible cases:
It is easy to check that for every
,
Here, we have a periodicity,
, which is impossible by the condition
. Hence,
must hold. □
Theorems 3 and 4 are generalizations of Theorem 2 of Lagarias et al. [
12], since they require only convexity and boundedness from below. Moreover, their proofs are much simpler.
6. Convergence of the Simplex Sequences
Here, we study the convergence of the simplex sequence
generated by both Nelder–Mead methods. If
, then (provided that
f is continuous)
for
as
. The limit vertices
can be different (see, e.g., Examples 1 and 2 of [
20]).
Ideally,
should converge to some limit of the form
, where
is a stationary point of
. MacKinnon’s example [
19] and Example 3 of [
15] show that
is not always a stationary point.
If
converges to a rank one matrix of the form
(
), then
and diameter
. There are several examples (see, e.g., [
14,
15,
20,
21]), where the simplex sequence
converges to a limit
with diameter
and rank (
, rank
.
The following necessary and sufficient result holds for both algorithms.
Lemma 10. Assume that is non-degenerate (affinely independent). Then, (generated by Algorithm 1 or 2) converges to some if and only if converges to some .
Proof. In both cases
, where
(in case of Algorithm 1) or
(in case of Algorithm 2). If
, then
. Since
, we have
By assumption, the first matrix on the right is invertible and so
If
, then
□
From now on, we assume that is always non-degenerate (its columns are affinely independent). Lemma 10 is formally independent of f. However, the decision mechanism of the NM algorithm determines the next simplex and the resulting matrix product. Certain configurations clearly cannot occur. Such is the case of strictly convex functions, when no shrinking can occur or cannot be repeated in a sequence more than five times (for the case ). In general, it is hard to tell what sequences (products) are possible for a given f.
Lemma 11. If (, or ), then rank .
Proof. Assume that rank
. Then,
is invertible. Since all of
is invertible,
. As
,
Hence,
is necessary for an invertible
. Since
or
is a finite set and its elements are different from
I, this cannot happen and rank
. □
Sylvester’s theorem on rank (see, e.g., Mirsky [
30]) implies that
Since, by assumption, rank
, we have
rank
rank
. In Example 4 of [
15], rank
, rank
and rank
.
The following simple result characterizes the possible limits of infinite matrix products.
Lemma 12. Assume that () and occurs infinitely often in the product , then every nonzero row of is a left eigenvector of belonging to the eigenvalue .
Proof. Since
appears infinitely many times in the product
, there is a subsequence of
with rightmost factor
, say
where the
s are products of
s. Since
, we have
. □
Define the (left) 1-eigenspace of matrix A by . The rows of belong to . If several , say (), occur infinitely often in the product , then the rows of belong to .
Since for all and for all , we have and , respectively. If or , then has the form for some .
Note that for any or , . Hence, if , then , that is, . If , then implies that .
We recall the following definitions and results (see, e.g., Hartfiel [
31,
32]).
A right infinite matrix product is an expression . A set of matrices has the right convergence property (RCP), if all possible right infinite products () converge.
A set of matrices is product-bounded if there is a constant such that for all k and all .
If is an RCP set, then is also product-bounded.
Lemma 13. If is an RCP set, and λ is an eigenvalue of , then or , and this eigenvalue is simple. Hence, each matrix of must satisfy this condition.
The matrices
and
have at least one eigenvalue greater than 1, and
and
in any induced matrix norm. Hence,
and
are unbounded. There are also examples when the Nelder–Mead algorithm produces unbounded simplex sequences
or
(see, e.g., [
14,
15,
20]). It is clear that the whole matrix set
or
is not an RCP set.
Hence, we must seek for subsets
of
or
, which are RCPs. However, Blondel and Tsitsiklis [
33] proved that the product boundedness of a finite matrix set
is algorithmically undecidable and it remains undecidable even in the special case, when
consists of only two matrices. Since product boundedness is a weaker property than the RCP from which it follows, and although it is algorithmically undecidable, it seems difficult to decide the RCP property in general. This might answer question (e) of Wright.
However, it is relatively easy to identify one particular RCP set. For both algorithms, the corresponding shrinking matrices form an RCP set.
Lemma 14. Both and are RCP sets.
Proof. By using induction on
k, we can prove that for
,
and
For the last formula, note that if
P is a permutation matrix, then
and
for some
j. In element-wise partial ordering,
holds for
and
Here,
as
, and the sequence
is monotone increasing in component-wise partial ordering. Hence,
is convergent. We also have that
,
as
, and
is monotone increasing in partial ordering. Hence,
is convergent. □
Remark 2. Corollary 2 implies that for all . If matrices () occur infinitely often in the product and (), then the nonzero rows of belong to . Hence, for some w. Lemma 1 implies that for all . It also follows from the previous argument that for some w.
The for the infinitely repeated shrinking process also follows from Cantor’s intersection theorem.
Lemma 15. Define the matrixThen, for all ( or ), the matrix has the common block lower triangular formwhere and . The result was proved for
in [
14,
15]. The proof of case
is similar. Since both
and
are finite sets, there exist two constants
such that
and
for all
or
. For
, the corresponding
has
eigenvalues of
and
. For
(
),
. However, for
with
and
, the corresponding
s have all their eigenvalues in the open unit disk (see Theorem 1). Hence, they can be elements of an RCP set.
If the infinite matrix product
(
(
) or
(
)) converges to a rank-one matrix of the form
, that is, if
(
,
), then it follows that
Lemma 16. For the convergence of the simplex sequence to a common point, it is necessary and sufficient that bothandhold for some vector . It is clear that the infinite products of or satisfy these conditions. We recall the following simple result.
Lemma 17 ([
14,
15,
16])
. Assume that , is convergent () and that for all k. Then, converges andfor some . If
for all
i, then the speed of convergence is linear (geometric). This follows from
and the inequality
with
.
We recall the following result (see Hartfiel [
31], Corollary 6.4, Daubechies and Lagarias [
34]).
Lemma 18. Let be a compact matrix set. Then, every infinite product, taken from , converges to 0 if there is a norm such that , , for all .
Hence, there must be a norm such that , for all (or ). Note that Lemma 18 is not constructive. However, for the convergence of (for all ), the condition () is necessary and sufficient.
For Algorithm 1, is the only subset that satisfies this requirement. For Algorithm 2, define the set and assume that
- (A)
there exists a matrix norm (induced by a vector norm ) such that if , then .
If (A) holds, then is an RCP set and every infinite product () has the form for some w. It follows from Theorem 1 that for , .
The existence of such norm
of the form
was proved in [
14] for
and in [
15] for
. For the cases
, the corresponding matrices
S, found experimentally by direct search methods, are given in
Appendix A. Since the eigenvalues of
(
,
) that lay inside the unit disk converge to 1 for
(see Lemma 3 of [
15]), it is hard to find a norm with property (A) for greater
n values.
The obtained RCP sets
and
are quite narrow. However, we can keep the convergence of any infinite product from
or
by inserting an infinite number of matrices from
under proper conditions (for Algorithm 2, see [
14,
15]). For Algorithm 2, it was also observed [
16] that several fixed-length matrix products have spectral norms less than 1. Upon these observations, the convergence set of both algorithms can be significantly expanded.
Assume again that
or
. Moreover, introduce the notation
and consider the product
, where
Let
Then,
and
where
and
are uniformly bounded.
Assume that
and consider the expression
which is clearly equal to (
32).
For
, define the sets
For any
, we have the estimates
and
. Hence, by Lemma 17, the set
is an RCP set, and the limits of the infinite products
are of the form
.
Now, consider the product
where
(
),
(
) and
are fixed (
). The corresponding reduced form is
Consider the product
(
), where the number of factors
that belong to
is
. Clearly,
. There exists a
such that
. Assume that there is a number
such that
. Then,
Since
and the conditions of Lemma 17 hold, we proved the following result.
Theorem 5. Assume that , is non-degenerate, is fixed and is not empty. Let be the number of ℓ products that belong to during the first k iterations of the Nelder–Mead method (Algorithm 2). Moreover, assume that for an integer , and for some , . Then, the Nelder–Mead method (Algorithm 2) converges in the sense thatwith a convergence speed proportional to . Note that assumption , which is a density condition, yields an infinite number of factors from the set as . The larger the ratio , the wider the convergence set. Theorem 5 is clearly true for any pair of subsets and such that and .
For Algorithm 1 and some , , to which we cannot add more elements, since in any induced matrix norm. For Algorithm 2, under Assumption (A).
Theorem 5 was first proved for Algorithm 2 for
in [
14,
15] and for
in [
16] in a slightly different way and form. The conditions were verified for a spectral norm and
up to dimension
.
The computational results of
Section 7 show that for
,
and
(see also [
16]). Hence, Theorem 5 indeed extends the convergence sets
and
(quite significantly).
Upon the basis of the earlier version of Theorem 5, a stochastic convergence result was proved for Algorithm 2 in [
16]. Since Theorem 5 holds for both algorithms, this can be extended for Algorithm 1 as well. However, it will be published later for reasons discussed in
Section 7.
7. Computational Results and Conclusions
Here, we consider the cardinality of set for both algorithms with and without shrinking operations.
Define the quantities
and
. The larger the ratio
or
, the wider the convergence set.
Table 1 and
Table 2 contain the computed values
or
, respectively.
Note that
and
. The ratio
cannot achieve 1 (see the examples within
Section 1 and
Section 6). For increasing
n, the
shrinking matrices somehow distort the real situation, since it is unlikely that all possible shrinking steps occur for one particular function.
It is clear that Algorithm 2 has a wider convergence set than Algorithm 1.
If we discard the shrinking operations (the case of strictly convex functions), then Theorem 5 also holds for the pair of sets
and
instead of
and
, respectively. Define again the ratios
and
. Note that for
and some
,
.
Table 3 and
Table 4 contain the computed values of
and
, respectively.
The comparison of the last two tables shows again that Algorithm 1 has a smaller convergence set and a slower convergence rate than Algorithm 2.
We can observe that the larger the ℓ, the larger is the ratio in all four tables. Although the ratios cannot achieve 1, it is conjectured that they might be quite close to 1 for a large ℓ.
For both algorithms, we can observe the increase in convergence sets, if
ℓ is increasing. For larger
n values, this increase is slower, especially if no shrinking is allowed. However, this requires an enormous computational time to check, even for modest pairs of
n and
ℓ. The reason for this behavior is unknown and yet to be investigated. However, we can establish the following conclusions. In case of the convergence of the simplex sequence, Algorithm 2 is better than Algorithm 1. Hence, the ordering of Lagarias et al. [
12] significantly improved the classical Nelder–Mead simplex method.
Theorem 5 indicates that the generated simplex sequence has a strong tendency to converge to one common limit point. This, together with Lemma 2 (or Lemma 3), may guarantee that the Nelder–Mead simplex algorithm generates good practical results.
In the case of the convergence of function values at vertices, the novel results are essentially the same for both methods. The extension of Theorems 2, 3 or 4 for is an open question. For , all proofs exploit (implicitly or explicitly) an involuntory property, which does not hold for . How the convexity assumptions can be relaxed further is also an open question. If f is quasiconvex on and bounded from below, then . However, the other parts of the proof do not work.