1. Introduction
The effect of a mutation on the fitness of an organism usually depends on the genetic background, or context in which it occurs, a phenomenon known as epistasis [
1,
2]. Because of this, the fitness landscape of a gene, a group of genes, or an organism will contain many isolated peaks and valleys [
3,
4], resembling the energy landscape of a physical system such as a glass. Under selection pressure, a population tends to evolve along a path of steepest ascent in fitness until it arrives in the neighborhood of a local fitness peak; In order to escape a sub–optimal fitness peak, the population, or some part of the population, must traverse across a valley of lower fitness. How such transitions occur [
1,
4,
5,
6,
7,
8], and how they relate to adaptation [
9] have remained subjects of active interest in evolutionary genetics for almost a century.
The most basic example of valley crossing is realized in the compensatory interaction of individually deleterious mutations at two genetic loci—for instance, as might result from the physical interaction between amino acids in a protein, or a pair of nucleotides in an RNA molecule [
5]. The archetypal model of this situation consists of a pair of diallelic loci with initial and final states
and
respectively; Mutations to
and
incur a fitness cost
s relative to
when introduced individually, but are neutral when introduced jointly. Kimura was the first to study this problem using the diffusion approach [
5,
10], and he found that deep fitness valleys could be crossed on a relatively short time scale if mutation rates are sufficiently large—specifically, when
where
N is the population size and
is the mutation rate per gene per generation. In this case, fitness valleys are navigated by a process known as stochastic tunneling [
11], in which a small fraction of genes accumulate in an intermediate state, are compensated by a second mutation, and ultimately proceed to fixation—the intermediate acting as a kind of stepping stone [
12]. The situation studied by Kimura closely resembles the process of Watson–Crick switching between favorably paired nucleotides in RNA stem sites, and in particular, switching in mammalian mitochondrial (mt) tRNA molecules where stochastic tunneling is significant. Meer et al. investigated this problem somewhat recently [
13], and, using Kimura’s model, they found that mammalian mt tRNA switches may navigate valleys of depth even as large as
(here, we assume that, for equal numbers of males and females, the effective population size for mitochondrial genes is one fourth the effective population size for nuclear genes [
14,
15]). In support of this result, Meer et al. obtain essentially the same estimate for
from the frequency (
p) of disrupted Watson–Crick pairs using the relation
for mutation–selection balance [
16]. To put this number into context, it is at least ten times larger than would be expected if the same model had evolved by sequential fixation of deleterious and compensatory mutations (i.e., as would be expected when
[
17]).
While these estimates may be accurate, it is difficult to reconcile the evolutionary dynamics of folded biomolecules with two–locus models. Naturally evolving genes encoding proteins and RNA molecules are always faced with a complex spectrum of possible routes on their fitness landscapes, and it is these spectra that ultimately determine the rate for crossing valleys of a given depth. Even for tRNA molecules, compensation of disrupted Watson–Crick pairs seems to occur more often through complex, indirect mechanisms than through direct compensation to restore Watson–Crick pairing [
18]. Proteins are more connected objects than RNA molecules (i.e., with more opportunities for epistatic interactions between loci), and the greater complexity of protein sequences is almost certain to present a more complex spectrum of possible routes to a protein gene in which valleys (ravines, etc.) are entered and exited in multiple steps (
Figure A1).
Are the large effects predicted by Meer et al. common in biomolecular evolution? To our knowledge, this kind of question has never been asked directly, by considering the problem of valley crossing for a protein or RNA molecule as a whole, including the complex effects of fitness interactions between multiple loci. Here, we simulate the evolution of a small protein motif using an exact fitness model that is simple enough to allow for adequate sampling of valley crossing statistics. We evolve populations of model genes by haploid Wright–Fisher dynamics across a range of mutation rates spanning the sequential fixation () and stochastic tunneling () regimes, and we record the mutational paths of all alleles in our populations. Using this data, we compute the probability that an allele, randomly sampled from a population at time t, has crossed a fitness valley of depth s during a time interval in the immediate past. Surprisingly, we find that, even on the time scale of human evolution, genes encoding small protein motifs evolving under conditions consistent with mammalian mitochondria already navigate fitness valleys of depth with probability , in rough agreement with the estimate for Watson–Crick switching in mt tRNAs provided by Meer et al.
2. Methods
Epistatic effects play an essential part in protein evolution [
19,
20,
21,
22,
23,
24], and because these effects depend on the relative probabilities of conformations in protein ensembles [
25,
26,
27], it is important to select a model in which the salient properties of protein ensembles are retained as much as possible. Ultimately, we found that we could obtain sufficient data for valley crossing statistics in a reasonable period of time using small lattice proteins. Lattice models have been used extensively in studies of protein folding and evolution, and the model we employ here is similar to one recently used to explore the effects of epistasis on the predictability of protein evolutionary pathways [
25].
Below, we evolve lattice proteins under equilibrium conditions to maintain marginal stability in a specific folded (native) conformation. The stability of a protein is measured, as usual, by the free energy difference between the native conformation and the rest of the conformational ensemble,
where
is the energy of conformation
, the subscript
N denotes the native conformation, and factors of temperature are absorbed into the definition of energy; The energy of a conformation is determined from its amino acid contacts by empirical amino acid contact potentials [
28] (as a result, energies are defined in units of
kcal/mol).
We assume that mis–folded proteins are non–functional, and otherwise toxic to an organism [
29,
30,
31]. In this case, protein fitness can be defined by the probability of finding an individual protein folded in its native conformation [
25,
32],
However, since most naturally occurring proteins are only marginally (as opposed to maximally) stable [
24], we decided to model fitness using a logistic function
where
(
Supplementary Figure S1). Under this condition, evolved genes in our simulations typically encode proteins with
or, equivalently,
[
33].
We evolve populations of protein genes by plain Wright–Fisher dynamics, with discrete generations, fixed population size, and no recombination [
34]. In each generation, a Poisson random number of nucleotide sites with mean
are selected at random; The sites are subjected to random mutation, the fitness values of mutant alleles are computed, and
N offspring are selected from the population to form the next generation. The probability that an allele
i survives to the next generation is
, where
is the frequency of allele
i in the current generation [
35].
In the absence of recombination, each allele in a population has a unique mutational history extending back to the origin of a simulation. To describe the statistics of valley crossing, we record the histories of all alleles, and we compute
, the probability that an allele, randomly sampled from the population at time
t, has crossed a fitness valley of depth
s during the time interval
, where time is measured in generations. Depending on its length, an interval
along the fitness history of an allele may contain several (perhaps nested) valleys of varying depth (
Figure A1). However, rather than attempt to record each valley as an individual event, we simply define
s as the maximum valley depth traversed along an interval (see
Appendix A). A peculiar feature of this approach is that, for small values of
s, we can choose an interval length
long enough that
is a decreasing function of
(i.e., since larger values of
s are more likely to occur on longer time scales). However, below we will be concerned mainly with large values of
s and time scales
that are far from this sort of turnover region.
3. Results
To obtain data for
in a reasonable period of time, we limit our study to chains with at most 16 amino acids (802,075 conformations unrelated by symmetry [
36]). The fitness landscapes of longer chains will clearly differ, however, we expect that within reason, results for somewhat longer chains (e.g., 32 amino acids) will be similar, given proper adjustments to the mutation rate per gene
(see below).
We simulate protein evolution for different chain lengths (
), native folds (
Supplementary Materials Figures S2 and S5), population sizes (
), and mutation rates (
). In each situation, we conduct replicate simulations in parallel on multiple processors of a high performance computer [
37]. Each processor begins with a monomorphic population constructed from
N copies of a gene encoding a randomly selected amino acid sequence. Each population is then equilibrated until an allele reaches fixation with
. After this point, alleles are sampled at random from a population every
generations and their histories are recorded. For the most computationally intensive problems (i.e., for the largest chain lengths and mutation rates), we are able to generate
samples in about ten days using 128 processors. For chains with 12 amino acids (15,037 conformations), samples can be obtained much more rapidly, and results for chains of 12 and 16 amino acids are actually very similar (computer code and sample data are available from the authors on request).
Since effective population size varies substantially across mammalian species, it is important to ask whether simulations conducted for a particular population size can be used to estimate
for larger (or smaller) populations evolving at the same overall rate
, as expected from diffusion theory [
38]. To answer this question, we compared the scaled distributions
for population sizes
, 200, 500 and 1000. For a fixed mutation rate
, we find that plots of
roughly collapse to the same curve. As a result, we can estimate
for realistic populations using a much smaller population size, which greatly reduces the amount of time spent on the simulations. The reason for this is fairly simple–the local structure of the fitness landscape, as measured by e.g., the distribution of fitness effects or the probability of compensatory neutral mutations, scales in a similar way with population size; As population size increases, the landscape in the neighborhood of an evolved sequence becomes less rugged in proportion to
N.
Results of this exploration are described in
Figure 1 and
Figure 2. In
Figure 1, we plot the distributions of beneficial and deleterious fitness effects,
and
, respectively. Each pair of plots corresponds to a simulation for one of the population sizes listed above (the width of a plot increases with decreasing
N). The results describe proteins with 12 amino acids folding to the native conformation in
Supplementary Materials Figure S2, and the overall mutation rate in each simulation is
.
To generate data for
Figure 1, we sampled the landscape around evolved genes using a simple procedure that mimics error–prone polymerase chain reactions [
41]; The procedure begins from a large number of copies of an evolved gene. A Poisson random number of random mutations are applied to each copy, and the results are sorted by the number of mis–sense mutations (i.e., neglecting back mutations). For each simulation, we randomly selected
evolved sequences. Each evolved sequence was then used to generate
random single amino acid mutants, for a total of
mutants per plot.
As is evident by closer inspection of
Figure 1, the probability of a beneficial (or deleterious) mutation with effect
decreases almost linearly with increasing population size; The scaled distributions
for different population sizes roughly collapse onto a single curve. A similar result is obtained for the distribution of compensatory neutral double mutants
(
Supplementary Materials Figure S3). The collapse is shown explicitly for the valley crossing probability
in
Figure 2.
Given this result, we now restrict our attention to populations with and proteins with 16 amino acids. To compare our results to those of Meer et al., we require the site mutation rate in our model to agree with the pedigree rate for the control region in human mitochondria used in their estimate for mammalian mt tRNA molecules—about per site per year. Assuming a typical length of about 80 nucleotides for tRNA molecules, a generation time of 20 years, and a (mitochondrial) effective population size of , we arrive at an overall mutation rate of for human mt tRNA genes. To obtain the same site mutation rate for protein genes with 48 nucleotides (16 amino acids), we need an overall mutation rate of about .
We plot
versus
for this situation in
Figure 3, where
. The range of the plot,
, roughly corresponds to the time scale of human evolution (about six million years). Over this time scale,
is roughly linear in
for
. The time scale for mammalian evolution is much longer (on the order of tens of millions of years), however, on human time scales, the frequencies of large events in our model are already approaching the ten percent levels observed for Watson–Crick switching events in mammalian mt tRNA phylogenies [
13]. For example, the probability of sampling an allele that has crossed a fitness valley of depth
for
is about
, the probability for
is about
, and the probability for
is already about
.
Clearly, these numbers will continue to increase for larger values of
and larger mutation rates
(
Figure 4). In addition,
will also increase with chain length since, for a constant site mutation rate, the mutation rate per gene
is proportional to chain length. If we assume that the fitness landscapes of proteins with 16 amino acids can be used to represent the landscapes of larger chains, then e.g., doubling the mutation rate per gene will have the same effect as doubling the chain length to obtain a protein encoded by the same amount of genetic material as a tRNA molecule or small protein motif [
42]. This is not an unreasonable assumption, since, as we have noted earlier (see the captions to
Figure 1 and
Figure S3), the local structures of fitness landscapes in the model, as measured by the scaled distribution functions
and
, are already similar to those inferred from real proteins with much longer sequences. In this case, extrapolating from the data in
Figure 4, we find that the probability of sampling an allele that has crossed a valley of depth
over a time interval
increases to about
. Thus, even neglecting the anticipated increase in
for
, the results for small protein motifs are already consistent with those of Meer et al.
As a final note, it is important to remark that the deepest valleys navigated by alleles in our simulations actually correspond to events in which a deleterious mutation is, to a major extent, compensated by a mutation back to a similar amino acid type at the same site (see
Supplementary Materials Figure S6). It is also worth noting that local fitness peaks sampled at various points in our simulations (by steepest ascent in fitness, starting from a randomly sampled sequence) are often separated by deep fitness valleys, or ravines.
4. Discussion
Does the model provide an accurate picture of fitness valley crossing for small protein motifs? This question is very difficult to answer due to the extreme complexity of protein fitness landscapes, and the unknown effects of varying host genetic backgrounds experienced by protein genes. However, from basic principles, it appears that the model is roughly accurate: Local features of protein fitness landscapes, such as the distribution of fitness effects, can be inferred from protein sequences by fitting a population dynamics model to branches of a phylogenetic tree [
39] such that background effects are accounted for by allowing the parameters of the model to vary among branches. Tamuri et al. have used this type of approach to estimate the distribution of fitness effects for mammalian mitochondrial proteins [
39], and our results for
are in good agreement with their data (
Figure 1). Given the similar nature of fitness conditions in each system (i.e., that proteins are also polymers required to fold into a specific shape in order to function), it is seems reasonable to expect that the topographies of model fitness landscapes resemble those of small protein motifs.
Longer chains with more developed core structures, and more restrictive fitness conditions such as binding to proteins within a larger domain, may lead to qualitatively different results due to the potential for larger compensatory effects [
27], and the suppression of substitution rates in the core and binding interface regions. In addition, much larger mutation rates can occur in micro–organisms such as viruses [
43]. and our results suggest that, under these conditions, fitness valley crossing will be more pronounced. RNA viruses such as HIV–1 are also subject to high rates of recombination [
44,
45], which may interfere with valley crossing [
5,
6]. Because the computational cost of our simulations increases in proportion to the number of mutations (i.e., the number of fitness calculations), a full study of the model for virus proteins may be challenging. However, we hope to address these problems in future work.