1. Introduction
Finding the optimal parameters of a high dimensional function is a common problem in many fields. We seek protein conformations with minimal free energy in biophysics, the genotypes with maximal fitness in evolution, the parameters of maximum likelihood in statistical inference, and optimal design parameters in countless engineering problems. Optimization can be visualized as climbing a landscape representing the objective function along the steepest directions. Often derivatives of the objective function are not available or are too costly, and derivative-free algorithms must be applied.
Evolutionary algorithms aim to find the optimum of an objective “fitness” function,
with parameters
x representing genetic or phenotypic variants. Algorithms differ in how new variants are generated and how the set of variants (the population) shifts over the course of the algorithm (selection). A class of algorithms called genetic algorithms are most directly inspired by the Wright–Fisher and Moran models from population genetics [
1,
2]. Genetic algorithms use mutation and recombination operators to generate new variants, and some form of biased stochastic reproduction to select variants for the next generation. Among the different stochastic selection schemes, fitness-proportionate selection is equivalent to natural selection in population genetics [
1].
Stochasticity of reproduction, known as genetic drift in population genetics, is a natural phenomenon present in finite populations and has the important effect of scaling selection inversely with the magnitude of stochasticity (i.e., population size) [
2]. Stochastic selection is very inefficient in that it causes the loss of many genotypes. For example, in the regime of strong selection and weak (i.e., rare) mutation of the Moran model, the probability that a single genotype will sweep a population (fixation) is proportional to its selective advantage. For a variant with fitness advantage of 1%, typical in bacteria, there is a 99% chance it will go extinct [
2].
In evolutionary optimization, stochasticity of reproduction may be helpful in exploring noisy or rugged fitness functions, but it is not essential if there are other sources of stochasticity. Some genetic algorithms use deterministic rank-based selection which removes individuals below some threshold [
1]. However, such truncation selection is very coarse, and does not affect proportionally the variants that survive.
Another class of evolutionary algorithms, which includes estimation of distribution algorithms [
3] and algorithms known as evolution strategies [
4,
5], are based on defining a population as a parameterized distribution
. These algorithms differ from genetic algorithms in that there are no selection or mutation operators applied directly to individuals in a population. In these algorithms, variants are generated by drawing from the distribution, and selection amounts to updating the parameters
in the direction of higher population mean fitness, i.e., the gradient
. To account for the uncertainty of the parameters some algorithms move the parameters in the direction of the natural gradient,
, where
is the inverse of the Fisher information which can be estimated from the population [
6]. The popular covariance matrix adaptation evolutionary strategy (CMA-ES) [
7,
8] and closely related natural evolution strategies (NES) [
9,
10,
11,
12] parameterize a population as a normal distribution, and use variants drawn from the distribution to update the mean and covariance with a natural gradient descent step. More generally, natural gradients describe the ascent of the fitness function in terms of information geometry, and their use characterizes a wide class of information-geometric algorithms [
13]. Computing the natural gradient in these algorithms often requires matrix factorization or inversion, and explicitly calculating parameters, such as the covariance matrix, can be costly if the dimensionality is high.
Here, we develop an evolutionary algorithm, similar to CMA-ES/NES, that uses deterministic natural selection to incorporate the natural gradient, without a need to store a covariance matrix or any costly linear algebra operations. We show that the natural gradient used in information-geometric optimization is also a fundamental part of natural selection. We show how natural selection is related to Newton’s method in optimization on a quadratic fitness landscape. We describe how the choice of the scale of selection has to balance between biasing towards high fitness and losing information of the underlying fitness function. We show how to generate new variants with a new recombination operator, without having to explicitly estimate a covariance matrix. Finally, we develop a proof-of-principle quantitative genetic algorithm (QGA) which iterates between generating variants with our recombination operator and a method of tuning the scale of selection to converge to an optimum.
2. Natural Selection Gradient
We begin by considering a population of infinite size and with a finite number of unique variants called phenotypes. A variant
i is defined by a continuous multivariate phenotype
(a vector), a frequency
and growth rate, or fitness
. Note that a variant does not represent an individual organism, but a number of organisms with the same phenotype. In the context of optimization, phenotypes are candidate solutions, and fitness is the objective function to be maximized. Classical replicator dynamics, leaving out mutation and genetic drift, describes the change in frequencies due to selection as
with means fitness
. In stochastic descriptions, these dynamics describe the expected change due to selection.
In the absence of other processes, frequencies can be integrated over time resulting in
with normalization
. At long time
t, the phenotype distribution will concentrate on high fitness phenotypes until the highest fitness phenotype reaches a frequency of unity. The change in mean fitness equals the fitness variance (Fisher’s theorem [
14]), and higher moments evolve as well. As an example we show 100 variants in a quadratic fitness landscape and how their frequencies change over time (
Figure 1).
Remarkably, replicator dynamics can be rewritten in terms of information geometry [
15]. The distribution over phenotypes can be described as a discrete categorical distribution, with parameters
, the vector of (linearly independent) frequencies. Natural selection adjusts these parameters in the direction of the covariant derivative [
16,
17], also known as the natural gradient [
6],
where
is the vector of partial derivatives of the mean fitness, and
is the inverse of the Fisher information metric of the categorical distribution. This metric defines distances on the curved manifold of probability distributions (see
Appendix A). Selection changes the frequencies in the direction of steepest ascent in non-Euclidean coordinates defined by the geometry of the manifold of the distribution.
The natural gradient is a geometric quantity independent of parameterization. Therefore, if the distribution over
can be well approximated by another distribution, selection will also change those parameters in the direction of the natural gradient. This can be demonstrated by projecting onto a normal phenotype distribution, as is assumed in classic multivariate quantitative genetics. The population mean
and population covariance matrix
parameterize the distribution, and selection changes the mean as ([
18,
19],
Appendix A)
where
is the associated Fisher information metric. Similarly, the covariance follows a natural gradient with a more complex metric (
Appendix A). If phenotype covariance reaches zero, then the population is monomorphic and there is zero selection. However, a related population genetics model in the limit of small diversity can search a fitness landscape with the help of mutations, with a covariance matrix defined by mutations serving as a metric (
Appendix E).
For a finite amount of time, the frequencies have Boltzmann form and the parameters trace a path on the manifold of distributions following the natural gradient. Exponential weights lead to natural gradient updates and are found in many optimization algorithms beyond GAs, such as simulated and population annealing [
20,
21]. In contrast, CMA-ES/NES uses a population to update the parameters of a normal distribution in a natural gradient step, and does not track frequencies.
3. Natural Selection Uses Second Derivatives
Natural gradient descent uses second derivatives, and is equivalent to Newton’s method on a quadratic objective function [
6]. Analogously, natural selection on a population of finite variants is related to Newton’s method on a quadratic fitness landscape.
Phenotypes remain normally distributed under replicator dynamics on a quadratic fitness landscape, which can be seen by using Gaussian frequencies in Equation (
2). The mean and covariance change as (
Appendix B)
where subscripts indicate dependence on time;
is the fitness gradient, evaluated at
;
is the fitness curvature—that is, the negative of the matrix of second derivatives of fitness, evaluated at
. The change in mean is a finite natural gradient step, while the covariance aligns itself with the curvature over time. Combining the two equations yields
In the asymptotic regime of large time t, corresponding to an iteration of Newton’s method. However, at large time t the distribution is no longer normally distributed due to the finite number of variants (see below). For small and intermediate time t, selection behaves as a form of regularized Newton’s method, where t determines how much to weigh the initial distribution. Since is always positive-definite, the population should not converge to a saddle point in fitness, which is possible in Newton’s method.
4. Selection Reduces Diversity
Natural selection moves the distribution along a manifold shaped by the fitness landscape. However, selection does not introduce any new phenotypes, and reduces diversity by biasing frequencies towards high fitness phenotypes. Diversity can be quantified by population entropy
, which summarizes the variation in the distribution of frequencies. The exponential of entropy
defines an effective number of variants, such that
, and
is the number unique variants under uniform initial conditions
. Diversity shrinks rapidly at large time
t, depending on the starting point
(
Figure 2A).
Furthermore, for large time
t and small diverity
the approximation of normally distributed phenotypes no longer holds, and Equation (
5) does not hold even if fitness is purely quadratic. The breakdown of normality is reflected in the moments of the fitness distribution. The gap between the mean and maximum possible fitness
, known as genetic load, shrinks as time increases, but under normal phenotype distributions,
where
D is the dimensionality of phenotypes (
Appendix C). Fitness has units of
, as evident from Equations (
1) and (
2). The unit-less approximate scaled load
where
is the maximum observed fitness, is zero when mean fitness approaches the maximum, and reaches a peak around some intermediate level of selection (
Figure 2B). Similarly, the fitness variance scaled by
has a peak at intermediate selection (
Figure 2C) since fitness variance must be zero at high selection.
We can regard time t as the scale of selection, as it directly multiplies the fitness. There is an inherent tradeoff in choosing the scale of selection, in that weak selection weakly biases the frequencies towards high fitness, and strong selection has low diversity, and the population is not well approximated by a normal distribution.
5. Generating New Variants
As selection
t increases, the population of phenotypes
moves up the natural gradient (Equation (
2)). To make an effective evolutionary optimization algorithm, we need to generate new variants based on the existing parent population.
A population with large enough diversity on a quadratic fitness landscape is approximately normally distributed, and the weighted population mean and weighted population covariance matrix can be interpreted as the weighted maximum likelihood estimates of the mean and covariance. Therefore, new variants can be generated by estimating these parameters and drawing from the distribution, as in CMA-ES/NES. In contrast to those algorithms, we do not need to explicitly calculate a natural gradient step, since this is done by tuning the weights of the variants. Instead of pursuing this approach, we draw more inspiration from biology.
6. Recombination Efficiently Generates Diversity
While selection moves the mean phenotype toward an optimum at the cost of reducing diversity, diversity can be restored by mutation and recombination processes that add new variants to the population. Biologically, mutations and recombination are changes in genotypes and their effect on phenotypes depends on the genotype-phenotype map. Mutations generate new variants that, if beneficial, may rise to some frequency after selection, and the expected cost or benefit of a mutation depends on the fitness landscape.
As an alternative to specifying a genotype-phenotype map, we define a mutation as a small stochastic normally distributed change in phenotype space. Whether new variants have higher or lower fitness than the exisiting population depends on the fitness landscape. Generally, fitness landscapes will have some directions which are very steep and some that are shallow. If a population is on such a high dimensional ridge, where most directions are costly, then a mutation is very likely to be very costly. For normally distributed populations and mutations, the expected fitness cost is proportional to the effective number of directions which are deleterious (
Appendix D).
From an optimization perspective, mutations are very inefficient in that they rarely generate good solutions since they are poorly aligned with the fitness landscape. However, recombination has the remarkable property of being adaptive to the fitness landscape. In population genetics models with genotypes, recombination is known to quickly break down correlations between sites in a genome until linkage equilibrium is reached [
22,
23]. Under these conditions, the phenotype covariance and expected fitness of recombined offspring is the same as the parent population [
22,
23]. If the population is on a high dimensional ridge and the population covariance is aligned, recombined solutions would also be aligned with the ridge. However, linkage equilibrium can only be reached on fitness landscapes without interactions between genetic sites, which excludes most fitness landscapes, including quadratic fitness-phenotype maps.
For the purposes of optimization, we are free to define any recombination operator. Genetic algorithms typically use recombination in the form of a crossover operator, similar to genetic recombination. However, a naive application of crossover to phenotypes would destroy any covariance in the population. Here, we define recombination of phenotypes that preserves covariance. A pair of distinct phenotypes, chosen by weighted random sampling with weights
and
can be recombined as
which has the same expectation
and covariance
as the parent population. This recombination operator preserves normal statistics, and is a way of sampling from an implicit normal distribution without estimating its parameters. This operator resembles the mutation operator in differential evolution optimization, which also preserves normal statistics for certain parameter values [
24]. However, the error in the mean and covariance can be large when diversity is low due to the finite number of variants, and the number of unique recombinants can be too limited. We improve the quality of recombination by generalizing to a stochastic sum over the entire population
where
is a random variate drawn from a standard normal random distribution
. This operator also conserves normal statistics, tuned by the scale of selection
t, and efficiently generates new variants without storing a covariance matrix. In comparison, CMA-ES/NES must store a covariance matrix, which may be challenging in large problems.
7. Quantitative Genetic Algorithm
We define a proof-of-principle quantitative genetic algorithm (QGA), as it is related to quantitative genetics, and is also a genetic algorithm that is quantitative in the sense that it tracks frequencies of types. QGA iterates between tuning the scale of selection and generating new variants.
A critical problem is how to increase, or sometimes decrease, selection over the course of optimization to make sure the population ends extremely closely distributed around the optimum. In QGA, variants are given Boltzmann weights (Equation (
2)), with
, and the choice of the scale of selection
t determines how much the population moves up the natural gradient at the cost of losing diversity. One choice is a fixed schedule of increases in selection
t, as in simulated annealing and population annealing [
20,
21]. However, choosing a good rate is challenging as increasing selection too quickly will lead to premature convergence, and increasing selection too slowly wastes computational resources. In addition, a constant rate may not work well when different parts of the fitness landscape have different curvatures.
It is clear that some intermediate level of selection is needed, although the peak in scaled genetic load or fitness variance is not necessarily the right balance. We implement an adaptive strategy that keeps diversity fixed and lets selection vary on its own. After a round of generation, population entropy will typically increase by a small amount. Then we choose a new scale of selection t such that the entropy is at a target value S, which is a hyper-parameter of the algorithm. This way the population adapts to keep the diversity constant, with high fitness variants driving selection higher as they are found. If the diversity S is too small the algorithm can still converge prematurely outside of the local quadratic optimum—i.e., on a slope.
For generating variants, we use the recombination operator Equation (
7), which has the same expected mean and covariance as the parent population. The most common approach is to operate in generations, where the population is completely replaced each round, at the cost of losing information from previous generations. Here, we take a different approach that discards information only when it becomes irrelevant. We generate only one variant per iteration, and integrate it into the existing population. The parent population has Boltzmann weights and the ensemble of possible recombinants are unweighted, and there is ambiguity in how to combine these different statistical ensembles. We found that the simplest approach is the most practical, which is to add each new variant to the population, and in the next selection step recalculate the Boltzmann weights. We can further simplify the implementation, by fixing the number of variants, discarding the variant with the smallest weight and replacing it with the new generated variant.
Furthermore, we modify the recombination operator by replacing the mean
in Equation (
7) with the variant with the highest fitness observed thus far, equal to the mean at infinite selection
. Shifting the mean of the recombinant distribution was found to perform much better in benchmarks (see below). The resulting algorithm (Algorithm 1) is very simple compared to CMA-ES, since it requires no matrix inversion or decomposition, and the only hyper-parameters are the target entropy
S and how the initial population of variants is generated. We provide an implementation online, with minor implementation details described in Methods.
Algorithm 1 Quantitative genetic algorithm. |
1: Choose hyperparameter S |
2: Draw population for all i from initial normal distribution |
3: repeat |
4: Find t such that |
5: Sample standard normal variates for all i |
6: Add to population |
7: until convergence criteria met |
8. QGA Performance
First, we demonstrate QGA on two simple test functions (see Methods). Optimization of a 5-dimensional quadratic function (
Figure 3A,B) converges to the optimum for large target entropy
S, and converges outside of the optimum for smaller entropy. The selection scale
t increases exponentially with the number of evaluations when the algorithm is converging. On a non-convex 2-dimensional Rosenbrock function (
Figure 3C–E) selection is tuned according to the curvature of the landscape over the course of the optimization. Initially, selection increases rapidly as the population falls into the main valley, then selection slows down as it moves along the flatter part of the valley floor, and finally speeds up again as it converges to a quadratic optimum.
To more rigorously assess performance, we tested QGA on noiseless test functions from the blackbox optimization benchmark library [
25] (implemented in [
26]), which implement an ensemble of “instances” of each function with randomized locations of optima. On 5-dimensional quadratic and Rosenbrock functions, the chance of convergence to the optimum, and the average number of function evaluations is sensitive to the hyper-parameter target entropy
S, with low
S leading to premature convergence, and high
S leading to a high number of function evaluations (
Figure 4 red lines). Performance is significantly improved with the modified recombination operator which replaces
in Equation (
7) with the best variant
. This modified algorithm is able to make larger steps towards the optimum with lower values of
S without prematurely converging (
Figure 4 blue lines).
We tested QGA on all 24 noiseless test functions with the modified recombination operator over many values of target entropy
S and compare to CMA-ES [
26] (
Figure 5). Overall, the performance of QGA with a good choice of
S is very similar to CMA-ES, which may be expected due to their fundamental similarities. Both algorithms perform well on functions with strong global structure, where quadratic approximations may work well, and perform worse on functions with highly multimodal structure. QGA has higher chances of convergence for some of the multimodal functions (F15–F22). For the step ellipsoidal function (F7), QGA fails completely due to a check for numerical error (see Methods), although CMA-ES also performs poorly.
9. Discussion
We have shown how a fundamental object known by different names, the Boltzmann distribution in statistical physics, the softargmax function in computer science, and the replicator equation in evolutionary biology, performs natural gradient updates. We used these insights of the natural gradient, along with insights drawn from biological recombination, to create a novel evolutionary optimization algorithm that is remarkably simple and efficient.
In benchmarks, QGA requires more fine tuning of its hyper-parameter than CMA-ES. However, QGA is much simpler conceptually and in implementation compared to CMA-ES and NES. Those algorithms are somewhat complex and involve matrix inversions or decompositions, as well as adaptive learning rate schemes. QGA does not require any advanced linear algebra and does not store a covariance matrix explicitly, which may make it possible to use on higher dimensional problems, where storage of a covariance matrix may be an issue. Additionally, QGA naturally incorporates information from the history of the optimization, whereas CMA-ES/NES has “generations” of samples and incorporates past information more heuristically.
In comparison to the general class of genetic algorithms, QGA has deterministic selection which is much more efficient than stochastic fitness proportional selection. In addition, the recombination operator preserves relevant correlations between dimensions, in contrast to typical crossover operators.
We have shown that selection steps in evolutionary optimization algorithms and population genetics are intimately related, and we developed a way to control the increase in selection. Our method of increasing selection by fixing the population entropy is simple and adaptive, yet its implications are not entirely clear, and there may be other strategies that tune selection more efficiently. The challenge is to find observables that indicate premature convergence early enough to be able to continually adapt the diversity of the population.
Our recombination operator is efficient in that it obviates the need to calculate the covariance matrix. Our algorithm (and CMA/NES) model the population as normally distributed, so it may be useful to extend our scheme to other distributions. Since the natural gradient is independent of parameterization, it is possible to generalize our scheme to any distribution. Specifically, the parameters of a generative model can be estimated by maximum likelihood, with variants as data points with Boltzmann weights. This distribution is then used to generate new variants for the next iteration. The benefit of using our form of natural selection is that it is not an explicit natural gradient update, which requires second derivatives of the generative model. For natural selection, the limitation on diversity still applies, i.e., when selection is high, there are fewer degrees of freedom to approximate the objective function. Such an algorithm may be useful in generating high fitness variants for difficult problems, including complicated discrete spaces such as in the directed evolution of proteins.
10. Methods
10.1. Algorithm Details
Inputs are the chosen target entropy
S (in base 2), and the mean and variance of the initial normal distribution from which
variants are drawn. In the selection step, we calculate
, where
and
t is incremented or decremented geometrically by a small value until the entropy matches
S. In the recombination step, a new variant is generated with an unbiased version of Equation (
7), where
. To simplify the implementation, the list of variants is held at a fixed size
by replacing the variant with the lowest probability with the new recombinant. Selection and recombination are iterated until the desired convergence criteria are met.
In addition, the algorithm is stopped when duplicate values of fitness for different parameters are detected in the population. The reason for this check is that often when the algorithm is prematurely converging, selection t diverges, and numerical error results in non-unique values of fitness. This check for numerical error causes the algorithm to fail on the step ellipsoidal function (F7).
10.2. Test Functions
The test ellipsoid is , and populations were initialized with 200 random variants drawn from a normal distribution with mean and variance equal to all one, so that populations are not centered on the optimum at zero. The test Rosenbrock function is , and populations were initialized with 200 random variants drawn from a normal distribution with mean (0,1) and standard deviation (0.25, 0.25).