1. Introduction
The origins of life, abiogenesis, is a matter of high importance, for it gives insight into the distribution of life in the universe. We focus on the RNA world hypothesis, where life began with self-replicating RNA molecules that can evolve under Darwinian evolution, following necessary conditions of compartmentalization and metabolism, for geometry and synthesis of nucleobases from metabolic precursors, respectively. Self-replicating sets of RNA were proposed first by Tibor Ganti [
1,
2] and have been studied by many others [
3,
4,
5,
6]. This is an information-centric perspective on abiogenesis, representing the putative beginning of genomic Darwinian evolution. Information centrism interprets a living organism as an operating genetic communication system in some connected domain that encodes and decodes genomic state relative to a replication channel.
While genomics, epigenomics, and transcriptomics of modern-day organisms are based on DNA, RNA, and epigenetic marks such as DNA methylation, RNA origins in their purest form concern the dual-function of RNA as an informational polymer and ribozyme. This article is similar in spirit to works in the 1970s through the 1990s, including Manfred Eigen’s works on replicating sets of RNA [
7,
8]. Clues to the RNA world, among others, are found in the nucleotide moieties in acetyl coenzyme A and vitamin B12, the structure of the ribosome as a ribozyme [
9] and moreover the centrality of RNA to the translation system, and the existence of viroids [
10]. A putative canonical RNA origins sequence involves RNA dependent RNA polymerase (RdRp) ribozyme, whose first gene is perhaps the Hammerhead (HH) ribozyme, enabling rolling-circle amplification of the sequence [
11]. This setup is unnecessary, as the first role of RNA may not have been template-assisted polymerization but instead based on mechanisms of RNA recombination and networking [
12]. If we assume the RdRp sequence is size 200 nucleotides, then there are
sequences. Starting from some population of interacting RNA molecules, we are interested in the times of first occurrence of critical events. Evolution seems to have concluded this search for a high-fidelity replicator in a fairly short period of time, i.e., within 400 million years of the Earth having a stable hydrosphere [
5]. RdRp’s are known to be very ancient enzymes, are necessary to all viruses with RNA genomes, and have been proposed to have originated from junctions of proto-tRNAs relative to the context of a primitive translation system [
13]. Recent work has shown that replicative RNA and DNA polymerases have a common ancestor of a RdRp [
14]. Directed evolution, selecting on polymerization, is a potential way of identifying such a ribozyme. Directed compartmentalized self-replicating systems (RNA or DNA polymerases) mimicking prebiotic evolution have been demonstrated whereby polymerases are selected on their ability to replicate their own encoding gene [
15]. Directed evolution has identified a RdRp that can replicate its evolutionary ancestor, an RNA ligase ribozyme; however, at increased activity, it has reduced fidelity and cannot maintain the integrity of its information [
16].
A subtlety to the RNA origins argument is that template-directed polymerization by its nature requires two copies of the RdRp sequence, one for the polymerase and another for the template. This makes the RNA origins search extend until two copies are discovered. Cross reactions with other species also may influence the search time for the high-fidelity replicators. The clay mineral montmorillonite, which is common on Earth, can catalyze RNA oligomerization [
17]; however, the utility of montmorillonite in these activities is not thought to be sufficient for origins, having been extensively studied [
18,
19]. Interestingly, it has been proposed that clay not only promotes origins, but constitutes it, which then later gave rise to RNA-based replication [
20]; such mineral life as a genetic communication system has a high mutation rate and is degenerate.
Theories of abiogenesis study metabolism [
21], cellular compartmentalization [
22,
23,
24], hydrothermal vent chemical gradient energy [
25], or hot springs [
26,
27,
28], and so on, to define geochemical settings suitable for origins. These settings are compatible with RNA origins. Compartmentalization leading to selection on random sequences has been explored by studying environment forcing in hot springs and their effects on sequence identification [
26,
27,
28], where the hypothesis is that fluctuations in environment forcing through cycling of wet, dry, and moist phases of lipid-encapsulated sequences subject sequences to combinatorial selection and identify structural and catalytic functions from the initial system state of random sequences. These functions include metabolic activity, pore formation, and structural stabilization. We assume the prebiotic molecular inventories of RNA and its precursors are provided by meteorites [
29] and/or by Miller–Urey processes [
30], such as from formamide [
31] or many phase synthesis [
32,
33]. For energy and environmental factors, we consider a variant of Darwin’s “warm little pond”, where the putative environment for RNA origins of life is an icy pond with geothermal activity, a hot spring, or perhaps a hydrothermal vent: ice and cold temperature facilitate complexing of single strands into double strands and polymerization [
34], and heat (energy) facilitates dissociation of double strands into single strands. More information on abiotic sources of organic compounds, mechanisms of synthesis and function of macromolecules, energy sources, and environmental factors can be found in the literature [
35].
RNA origins have attracted many modeling efforts and analyses [
36,
37]. The concept of a self-replicating set of RNA molecules was initially studied by Manfred Eigen [
7], wherein he studied the error-threshold of the critical fidelity to the main information. Two-dimensional spatial modeling has been applied in which reactions occur locally with finite diffusion, suggesting a spatially localized stochastic transition [
38], simulated using Gillespie’s stochastic simulation algorithm (SSA) [
39]. Another model is of autocatalytic sets of collectively reproducing molecules, which has been developed in reflexively autocatalytic food-generated (RAF) theory [
40]. Various physics-based analyses have been conducted, such as in light of Bayesian probability, thermodynamics, and critical phenomena [
41]. Systems of quasi-species based on the principle of natural self-organization called hypercycles employ non-equilibrium auto-catalytic reactions [
8]. Nonlinear kinetic models for polymerization have been used to study the emergence of self-sustaining sets of RNA molecules from monomeric nucleotides [
42]. Theoretical analysis has been conducted into RNA origins. Attention has been drawn to an evolving population of dynamical systems and how dynamics affect the error threshold of early replicators and possibly towards compartmentalization conveying hypercycles [
43]. String-replicator dynamics have been studied and properties suggested to be necessary to RNA origins, including the ability to operate a functional genetic communication system and ecological and evolutionary stability [
44,
45,
46]. A variety of pre-RNA worlds have been suggested, with RNA being preceded or augmented by alternative informational polymers, such as other nucleic acids [
47], beta amyloid [
48], polycyclic aromatic hydrocarbons [
49], lipids [
24], peptides [
50], and so on. It seems that pre-RNA worlds existed independent of the RNA world in the sense that they are not ancestral to the RNA world, and that these worlds may have had non-trivial interactions with the RNA world.
A key concept of stochastic systems is that of a hitting time: the time of the first occurrence of some event. We develop a simple mathematical model at the sequence level to represent the synthesis and function of RNA molecules in order to gain insight into the hitting times of various critical events of RNA origins of life. The idea is to study the surface of hitting times in terms of the structure of the system. The model lacks many features of realism, such as sequence size variability, finite sources of “food” (activated nucleotides in our context), limited diffusion rates, poor system mixing, and so on, in order to concentrate on the process as a search problem. The notion of fitness landscapes has been studied extensively in evolutionary biology [
51]. Landscape topology has been considered in an Opti-Evo theory, which assumes sufficient environmental resources and argues that fitness landscapes do not contain “traps” and globally optimal sequences form a connected level-set [
52].
We describe the model in
Section 2, where we define a replicating reaction network, whose random realizations are constructed using SSA. We describe hitting times as key random variables of interest and characterize polymerization as a transition kernel. In
Section 3, we conduct and discuss simulation studies based on SSA, where we analyze the structure of the polymerase measures and the input–output and survival behaviors of the hitting times given the parameters of the system. In
Section 4, we end with conclusions.
2. Materials and Methods
We define
, abbreviated as
, corresponding to the RNA bases. We study the space of sequences having length
n, that is,
with space of possibilities (a
-algebra)
, so that the pair
is the measurable space of all RNA sequences of length
n with
. We let
be a probability measure (distribution) on
, giving the probability space triple
.
Appendix A gives an overview of
. A related space, though not utilized in this article, is the space of all RNA sequences up to length
n,
with
and size
. We denote the collection of non-negative
-measurable functions by
.
We build a simplified mathematical model for the time-evolution of a population of interacting RNA molecules in solution. Let be the population at time with initial population . is a multiset, that is, it is a set containing elements possibly with repeats. We assume that the system is well mixed and has access to an infinite source of activated nucleotides.
The complement of
is denoted
, attained using the base-pairing
A with
U and
C with
G. Let
be the Hamming distance between
as the number of positions in
x and
y where the nucleotides differ. We have
and
.
2.1. Core Model
We describe the reaction network of the system below. We simulate trajectories of the system using the stochastic simulation algorithm (SSA) [
39], to simulate exact trajectories for the evolution of stochastic reaction networks here. SSA forms a Markovian process, where the arrival of reactions follows a Poisson (point) process, and assumes that the reaction volume is well mixed and homogeneous, with all parts of the system accessible for reactions. Reactions across various disjoint volume elements of the system are dependent. We do not consider the effects of finite diffusion, which effects a length scale above which disjoint volume elements are effectively independent [
38].
2.1.1. System
We model the population of sequences which can form double-stranded helices, dissociate, and replicate with mutation with replicator fitness and sequence specificity. We interpret each system element as a set, either containing one element—a single-stranded sequence—indicated as
—or two elements, single and complementary stranded sequences, indicated as
, where
(double bracket notation indicates a collection, or set, of sets, i.e.,
). We define system elements as sets
with respective
-algebras,
,
and
. The sizes are
and
, and
. The set
contains single-stranded sequences, whereas the set
contains double-stranded sequences; the set
is the union of
and
, containing both single and double stranded sequences. These sets are necessary to track the various sequences (single and double stranded). The reaction network of the system is given by
and expressed in terms of sets
Reaction (
2) is double-strand formation from complementary single-strands with reaction rate
. Reaction (
3) is the dissociation of double-strands into single-strands, caused by a heat source, with reaction rate
. Reaction (
4) is template-directed polymerization of a single-strand (the template) by another single-strand (the polymerase), producing a single-strand complementary to the template with some fidelity, with reaction rate
(which functionally depends on the polymerase and template sequences).
The polymerization reaction rate is defined by
where
is a positive constant,
is the replicative
fitness of
x (as a polymerase) and
is the
similarity between
x and
y, a symmetric function. Finally,
x replicates
y, outputting a version
with mutations, where each nucleotide position has fidelity probability
. Note that the similarity function can be either trivial/constant, i.e.,
, or non-trivial. For example, if we assume sequence similarity to correlate to spatial proximity of sequences, as assumed below, then
is non-trivial.
2.1.2. High-Fidelity Set
To define a high-fidelity set, pick an arbitrary subset of sequences as high-fidelity replicators of size . We define two ways.
We define
using a product of non-empty random nucleotide subsets
for each nucleotide position
so that
where
, etc. and
. For simplicity, we assume
with fraction 4 being
. Thus,
is a subset of
E defined as a product space.
For another construction of R, we define a finite union of m random sequences .
2.1.3. Distance
We define the Hamming distance
between sequence
x and high-fidelity sequence set
as
2.1.4. Fitness
We define “tent-pole” fitness
of sequence
and high-fidelity sequence set
for curvature parameter
as
The maximums are the sequences of the high-fidelity sequence set
, which are the “points” or “poles” of the surface, with exponential decay into the remainder of the space in string distance. The strength of the decay is governed by parameter
k, called the curvature parameter, which can be specified through the value of fitness at
(sequence dimension),
2.1.5. Similarity
We define two cases for the similarity function appearing in the reaction rate of template-directed polymerization. The first case is the trivial (constant) case where the
This assumes that there is no mechanism by which sequence specificity is selected for, such as in the case that polymerases should evolve to generically well replicate sequences, including their own. This means that they will spend much of their time replicating other sequences.
For the second case of a non-trivial similarity function, we note that RNA origins of life are thought to be a spatially localized stochastic transition, where high-fidelity replicators are found concentrated in foci, following from the increased replicative mass of the replicators. Hence, we implicitly encode spatial information through a non-trivial similarity function, based on a distance function, which increases the replicative system mass for similar (and here nearby) high-fidelity replicators, that is, if they’re similar, then they’re likely proximal. In the following definition, we use the notation ∧ for the minimum of two numbers,
. Distance
S is defined between two sequences
as
Similarity
of sequences
for curvature parameter
is defined in terms of exponential decay as
Presently, replicators can replicate other sequences well but not their own [
34]. There may exist RdRps that are excellent polymerases and, in conjunction with RNA hammerhead ribozyme, engage in rolling circle amplification of the polymerase-hammerhead sequence (genome) so that the amplification process is self-cleaving. This results in a large increase in replicative mass due to the super-exponential growth in the population of the high-fidelity replicators within a small volume. In the context of SSA, the similarity function here is an ansatz for spatial locality.
2.1.6. Fidelity
Finally, polymerization fidelity probability for curvature
is defined as
Note that for high-fidelity sequences .
Note that fitness, similarity, and fidelity are defined for single-stranded sequences .
2.1.7. Counting Representation
The process
is the time-evolution of the system. Recall that
is a multiset.
contains the individual single stranded molecules in the set
, i.e., sequences
and double stranded molecules in the set
, with overall set
having size
. Note that, in the set representation, there is symmetry
, so that the size of the double-stranded set is equal to
. The system evolution
induces a random counting measure
on the overall space of single and double stranded sequences
as
The total count, that is, the total number of molecules, is . We assume that the counter is maintained for all times .
2.1.8. Reaction Rates
The total reaction rate is given by the sum of the individual reaction rates
where the reaction rate of double-strand formation from complementary single-strands is given by
the reaction rate of dissociation of double-strands into single-strands is given by
and the reaction rate for template-directed polymerization of a single-strand by another single-strand (the polymerase) is given by
The first reaction rate is a sum over the single-strands of
with size
. The second is a sum over double-strands
with size
. The third reaction is a sum over the product space of single stranded sequences,
, with
number of elements. Therefore,
requires
elements to be evaluated for every reaction. Clearly, direct representation on the full space is very expensive and impractical for even modest
n. One obvious way to improve efficiency is not summing over the zero elements. We define sets
and
as the unique single and double stranded sequences of the system. Then, direct calculations of the reaction rates are
and
and
For this approach, the replication rate has quadratic dependence on . Using the reaction rates, the system may be exactly simulated using SSA. The reaction at time t with rate occurs over time interval . As the reaction rate increases with increasing number of molecules , the reaction rate increases and reaction duration decreases over time. The natural consequence of increasing process intensity is that the system speeds up.
The quadratic dependence may still be too expensive for large simulations.
Appendix E describes Monte Carlo approximation of the reaction rates.
2.2. Hitting Times
We define some hitting times. The initial population consists of
I single-stranded sequences
, i.e.,
. We define the hitting time
for the time of the first replication event
We define the hitting time
for the appearance of sequences in the high-fidelity sequence set
,
Put
and define the volume fraction of high-fidelity sequences
at time
t as
We define the hitting time
where high-fidelity sequences of
emerge and reach a minimum volume fraction,
This hitting time reflects that period wherein a high-fidelity replicator has been identified yet there exists no complementary high-fidelity sequence for amplification, hence the system diversity continues to increase, decreasing the concentrations of all extant sequences as more sequences are discovered. The minimum hitting time captures the duration of time the high-fidelity replicator exists by itself. We define the hitting time
for the time high-fidelity sequences in
constitutes some volume fraction
of the population,
In practice for simulations, is censored based on some total number of reactions, that is, if the volume fraction is not achieved by n reactions, because there is no arrival time.
For SSA, we specify a maximum number of reactions
N to simulate. We have parameters
for
, such as landscape curvature
k, sequence dimension
n, etc. Therefore,
is right-censored with value
∞ at simulation time
a, as some simulations will stop at time
a with no arrival time. These are censoring events. For fixed
, the
is a random variable, due to the stochastic nature of SSA. Hence, for each parameter vector
, we attain a set of
M realizations of hitting time
as
For convenience, we assume that the realizations are ordered by non-censored followed by censored.
2.2.1. Functional Structure
For each parameter vector
, we record two values: the number of hitting events in the hitting time set
and the average of the hitting time
To describe the functional structure of the average hitting time
, we require a classifier which determines whether or not there are zero hittings
and a regressor for the value of
for hittings
. We assume that the parameters
are randomly sampled according to distribution
and the hitting times recorded. High dimensional model representation (HDMR) may be attained for the classifier (as a probabilistic discriminative model) and the regressor of
. For the regressor, we have HDMR expansion
The HDMR component functions
convey a global sensitivity analysis, where, defining variance term
we have a decomposition of variance
The normalized terms
are called sensitivity indices.
Appendix F gives a brief description of global sensitivity analysis via HDMR.
2.2.2. Statistical Structure
A second analysis can be conducted on the hitting times
for the parameter vector
using reliability theory. Put random hitting set
where
is an independency of parameter values. For each parameter vector
, we partition the hitting times
into
C censored values with censor times
and
non-censored (hitting) values
. The likelihood is given by
where
f is the hitting time probability density function (‘failure density’) and
R is censoring time distribution (‘reliability distribution’), and
are the parameters of the density and distribution functions. Note that
f and
R each specify each other, so
are the common parameters. Reliability definitions are given in
Appendix G.
We show reliability quantities in
Table 1 for the two-parameter
Weibull
distribution and Cox proportional hazard’s model where
is a vector of coefficients for the
. We use the Python software
lifelines for estimation of
for the Weibull–Cox model from data [
53]. The mean failure time
is given by
We have second moment
giving variance
Thus, if , then and exponentially increase in .
2.3. Surface Chemistries
The system given by (
2) of polymerization with mutation requires two separate hitting events, one sequence in the high-fidelity set
and either another sequence in the high-fidelity set
or its complement
, in order for high-fidelity replicators to maximally engage in templated-directed polymerization and achieve some fraction of the population. This setup of RNA polymerase action, requiring two such events for the polymerase and template, makes the hitting times long. Basically, the same information must be discovered twice before it can be used, which is unsatisfactory. We idealize polymerase activity conveyed by a non-RNA species, here clay, with the parameter
, as clay itself is not thought to be capable of polymerization but is capable of oligomerization of RNA. The reactions are given by
where non-RNA polymerization has mutation with fidelity probability
and
is the complement of
x with mutation. The reaction rates are given by
and
Therefore, upon the first hitting of the high-fidelity replicators
with sequence
through (
4) or (
23),
x gives two high-similarity single-stranded sequences
x and
through (
24), which then may participate in template-directed RNA polymerization (
4).
2.4. Reactions as Measure-Kernel-Functions
All the reactions
which involve substrate
x may be represented using transition kernels, which form linear operators. At each iteration of SSA, a reaction type is chosen, followed by a transition to a particular domain
with distribution
, followed by mapping into a codomain
using transition probability kernel
Q with distribution
. The notions of
and
Q involve measure-kernel-functions. The probability of transition of
y into
given
x is given by
QWe define kernels Q for RNA and non-RNA polymerization to provide insight into the reactions. Consider for some . Recall that is the random counting measure of on single and double-stranded sequences . For RNA and non-RNA polymerization, we take as a (random) probability measure for x in domain and describe a transition probability kernel Q from x into y in codomain .
2.4.1. RNA Polymerization
For RNA polymerization, we have that
which results in the creation of the single-stranded sequence
. The first dimension is the polymerase and the second is the template. We give a definition of the probability measure on the product space of sequences. Recall that
denotes the collection of non-negative
-measurable functions.
Definition 1 (Measure on domain
).
Let be a random probability measure on formed by random counting measure (12)We write for .
Because the first two coordinates are preserved under the mapping, we focus on the new dimension as a transition from into using transition probability kernel Q. In this case, Q is defined by a matrix whose rows vectors (dimension ) are probability vectors. The structure of Q follows from the polymerase replication with mutation, whereby each nucleotide position has fidelity probability , which depends on the first dimension of . We put for sequence . Now, we state a simple fact on the binomial structure of the number of mutations made by a polymerase.
Theorem 1 (Mutation distribution).
The number of mutations by polymerase on template is distributedwith mean and variance and Now, we partition
E into level sets
by Hamming distance to the template complement
,
We define the transition kernel Q for RNA polymerization, where Q completely encodes RNA polymerization using Theorem 1.
Corollary 1 (Transition probability kernel
Q).
We have that the transition probability kernel Q for RNA polymerization is defined by RNA polymerization is defined by
Q using the binomial structure of polymerase mutation. A more sophisticated model could be defined as a sum of Bernoulli random variables with varying success probabilities in the Poisson binomial distribution. This could be used to take into account polymerase mutation that varies with nucleotide position. Another idea is taking into account schemata such as repeats which destabilize the polymerase [
54].
Proposition 1 (Measure on codomain
).
is a probability measure on defined byIt is multiplication of as a dimension row vector with dimension matrix Q, giving a dimension row vector . We write for .
Define the partition
of
E as
Then,
for
is the distribution on sequences by distance to
R, i.e.,
contains the instantaneous information of RNA polymerization.
A more general model for replication is where polymerase activity is tied to geometry, i.e., compartmentalization/spatial confinement, with state space
and to metabolic state
. In this telling, the polymerase reaction rate could be tied to the degree of spatial confinement and the source of the activated nucleotides from metabolic precursors. Then, the polymerase state-space is
with law
and the polymerase transition kernel
is defined as the mapping from
into
. Thus, the law on the input–output space-state
is given by
, or in differential notation,
2.4.2. Non-RNA Polymerization
If there exists some kind of non-RNA polymerase activity, we have that the mapping
which we regard as a mapping from
into
. Let
be a probability measure on
defined by
Similar to RNA polymerization, for fidelity probability
, we have
Q as the
matrix defined by
and
and
is a probability measure on
defined by
Note that, for SSA, Q is fixed over the simulation, whereas the probability measure depends on time. That is, the reactions are chosen according to the reaction rates, and the reactions each use respective Q. The is formed using a random counting measure, so is random. This approach generalizes in the obvious way to all the reactions.
2.5. Decay
The RNA sequences have finite lifetimes in reality. This comes from a variety of sources, including radiation, pH, intrinsic molecular stability, etc. We assume double-stranded RNA is stable, whereas single-stranded RNA is not. Therefore, we create a reaction for decay of single-stranded RNA into constitutive nucleotides
with reaction rate
2.6. Compartmentalization
It is thought that compartmentalization plays a role in RNA origins of life, giving foci of reproducing sequences [
23,
55]. This is somewhat anticipated by the similarity function
, where sequences are more likely to copy similar sequences than less similar ones, due to an underlying spatial localization. Explicit spatial effects may be modeled by assuming each
is marked with a position on a bounded subset of the real line
. We think of this as a one-dimensional projection of the three-dimensional system. Additional species can be introduced, such as lipids, with reactions forming a vesicle
M (vesiculation), which encloses some
. We assume the lipids interact with the single stranded sequences in
A to form vesicles as
with reaction rate
Hence, vesiculation is coupled to the population of sequences by design so that it evolves on roughly the same time-scale as sequence activities. Note that vesicles can enclose one another, i.e., and where or , but cannot cross, i.e., for all vesicles at locations we have that . For example, suppose one vesicle encloses another two, and . Then, . Although is disconnected in one dimension, the intervals are physically connected in three dimensions, where vesicles are spheres. We identify each vesicle to a union of disjoint intervals, disjoint across the vesicles.
We posit that compartmentalization precedes the hitting of a high-fidelity replicator through ensuring necessary concentration of RNA and a stable environment. Generally, we identify compartmentalization state to with probability measure . Let be a transition probability kernel from into , encoding the transition from compartmentalization coordinates to RNA sequences. The product space has law . Upon hitting a high-fidelity replicator and achieving Darwinian evolution to acquire information, e.g., genes, the sequences are assumed to become adapted to compartmentalization coordinates through the transition probability kernel from into , so that is the law on the full space . In this telling, compartmentalization precedes RNA activity, and, upon hitting high-fidelity replicators that can maintain their information, is followed by genomic adaptation.
2.7. Metabolism
We identify metabolism reaction-state to the measurable space
with probability measure
. Let
be a transition kernel from
into
, positing that metabolism precedes replication. For example, certain metabolic state may be precursors to the synthesis of RNA. Consider product space
with measure
. Now we suppose that, upon achieving Darwinian evolution in replicators, the replicators will eventually become adapted to
. Hence, we interpret
as a mark-space of
, representing genomic adaptation. Let
be a transition kernel from
into
. Then,
is a probability measure on
, where
or
Therefore, metabolism-first followed by replication and genomic adaptation is encoded by the structures of and . We do not specify these transition kernels in this article but mention that they are richly textured.
2.8. Reaction Overview
The reactions of the system having decay and clay and their reaction orders are shown in
Table 2. There is one zero-order reaction, three first-order reactions, and two second-order reactions. Additionally, we show reactions and orders for compartmentalization and metabolism.
3. Results
Consider some initial population of
I random sequences
. The population over time is given by
with associated random counting measure
on
. Recall parameters
for sequence dimension
n, high-fidelity sequence set size
q, fitness degree
k, similarity degree
l, fidelity degree
m, clay fidelity probability
p, RNA decay rate
, double-strand dissociation rate
, double-strand formation rate
, RNA replication rate
, and clay oligomerization and polymerization rates
and
. These parameters are summarized in
Table 3.
The following is the description of how the parameter values were specified and to what they biologically correspond. The sequence dimension n is chosen from . The fitness and similarity functions are chosen by setting the value of the range of the curvature parameters k and l from one (inside the high-fidelity manifold) to some small values, such as over an exponential grid. For example, when , the fitness of sequences that are maximally dissimilar have 10% of the fitness of the high-fidelity sequences. We range the grid from to for fitness and similarity. The RNA fidelity parameter is chosen such that the high-fidelity sequences have value one and the lowest fidelity sequences have value 0.25, equal to random chance. The clay fidelity parameter is set to an optimistically high value of for clay studies. The double-strand dissociation and formation rates and are set to unity as a baseline. In comparison, the RNA replication rate is set to a large value, 10, whereby replication is the dominant reaction. The decay parameter is set to some uniform random value in (0, 1). The clay RNA oligomerization rate is set to unit, and ‘clay’ RNA polymerization rate is set to a uniform random value in (0, 20).
With the parameters governing the reaction rates, different values of these parameters confer different regimes for the system.
3.1. Stability: ODEs
We characterize the zeros of the vector field
f from ODE system (
A1) and use the eigenvalues of the Jacobian (
A2) to determine their stability.
Theorem 2. The ODE system (
A1)
for , , has a single unstable fixed-point at and for . Proof. Solving gives a single solution and for . For this solution, the eigenvalues of the Jacobian contain no zero values and positive values. Therefore, the solution is unstable. □
It follows from Theorem 2 that, for all other initial conditions, the system has no equilibria.
Corollary 2 (Unbounded). For all initial conditions such that , the system is unbounded.
This confirms the obvious: the system, a replicating network with no death, is almost always an increasing system.
3.2. Simulation Reaction State
We are interested in the behavior of temporal probability measures,
(
25) on the sequence product space and
(
27) on the sequence space, for RNA polymerization. These reveal the instantaneous information of the system. The structure of
reveals the state of polymerization and is a leading indicator of the population concentrations over time.
3.2.1. Core Model with “Tent” Functions, Probable Hitting
Take sequence dimension
and fitness and similarity curvature parameters
and fidelity parameter
. Set rates for double-strand dissociation and formation
and polymerization rate
and use the “tent” function for fitness, similarity, and fidelity. Take random initial population
with initial population size
and random singleton
(
). We simulate 5000 reactions, simulation censored at hitting time
for volume fraction
. Take partition of the sequence space by Hamming distance to the high-fidelity manifold
(
28) of sequence space
E. In
Figure 1, we plot measures of a typical realization of the system
on the partition
of sequence concentration (
Figure 1a), growth (
Figure 1b), and polymerase sequence output
(
Figure 1c). Some quantities are plotted on log-log scale, whereas others are plotted on a linear-log scale. These results show that the concentrations are relatively stable for most time, until the high-fidelity manifold is hit. Then, the concentration of high-fidelity replicators rapidly increases to exceed 25%. Similarly,
Figure 1b shows the growth curves on a log-log scale, where the high-fidelity manifold rapidly increases near the end of the simulation.
Figure 1c shows the structure of the RNA sequence polymerization output temporal probability measure
. Low probability is assigned to polymerization of high-fidelity replicators for most of the reaction time, followed by a large increase near the end of the simulation, where high-fidelity replicators dominate with 56% probability.
Therefore, the RNA sequence polymerization output temporal probability measure is a leading indicator of the concentration curve, i.e., at simulation end-time, concentration of high-fidelity replicators is 25% and polymerization output is 56%. 3.2.2. Core Model with “Tent” Functions, Improbable Hitting
We use the same configuration as
Section 3.2.1 except for setting fitness and similarity curvature parameters to
. In
Figure 2, we plot measures of a typical realization of the system
on sequence partition by Hamming distance to the high-fidelity manifold
of concentration (
Figure 2a), growth (
Figure 2b), and
(
Figure 2c). The behavior has completely changed: the high-fidelity group ends the simulation with around 6% concentration, only steadily increasing, and never hits. The polymerase output
shows 6%. This indicates that the concentration of high-fidelity replicators is unlikely to increase further, as the population is generally in equilibrium with the polymerase output.
3.2.3. Core Model with Linear Functions, Improbable Hitting
The same configurations for
Section 3.2.1 are used, except the fitness, similarity, and fidelity functions are linear. Similar to the “tent” functions, we specify the terminus landscape curvature for fitness and similarity
. Then, the fitness function for RNA polymerization is given by
and
We put fitness and similarity landscape curvature
for fitness and similarity and
for hitting volume fraction of high-fidelity replicators. We simulate
for 5000 reactions. We find that the probability of hitting is near zero
. In
Figure A1, we plot measures of a typical realization of
on
of concentration (
Figure A1a), growth (
Figure A1b), and
(
Figure A1c). The simulation ends with high-fidelity concentration of ∼5% and polymerase output of ∼4%. Therefore, the concentration of high-fidelity replicators will continue to decrease.
Linear surfaces are not sufficient to achieve hitting times for high-fidelity replicator volume fraction , in contrast to the nonlinear “tent” functions. 3.2.4. Expanded Model with “Tent” Functions, Probable Hitting
We consider a similar model to the previous subsections and expand it with clay oligomerization rate (of RNA)
, clay polymerization rate (of RNA)
, and clay polymerization fidelity
p. Therefore, the full set of variables is given by
. The value of fitness/similarity landscape curvature
and clay RNA polymerization raet
are set such that the replicative mass of each is initialized to 10. This means that RNA and clay polymerization have the same reaction mass at the beginning of the simulation. We set sequence dimension
, fitness/similarity landscape curvature
, clay RNA polymerization fidelity
, and double-strand dissociation and formation rates
. This is a high hitting regime, i.e., the probability of hitting is close to one
. In
Figure A2, we plot measures of a typical realization of
on
and additionally the probability of reactions over time. High-fidelity replicators ended the simulation with 25% concentration (
Figure A2a) and RNA polymerase output ∼62% (
Figure A2c), indicating that the concentration of high-fidelity replicators will continue to increase. All species exhibit superexponential growth (
Figure A2b). Clay polymerization decreases in contribution over time, whereas RNA polymerization increases substantially over time, and RNA double-strand reactions are small and stable (
Figure A2d).
3.3. Hitting Times: Functional and Survival Analysis
We study various models in order of increasing complexity. We examine the hitting time surface in the parameters , including probability of hitting . We begin with the core model with no decay or clay.
3.3.1. Core Model, for with and “Tent” Functions
We are interested in the structure of the hitting time
of (
19) as a function of the parameter vector
. We use the Weibull–Cox proportional hazard’s model of
Table 1 for the hitting time
for volume fraction
. Let
with sequence dimension
and fitness and similarity parameters
for
. Set
for fidelity probability parameters. For each value of sequence dimension
n, take random initial population
with initial population size
and random singleton
for the high-fidelity manifold and fix these for the fitness/similarity landscape curvature parmeters
. We fix the double-strand dissociation and formation rate parameters
and set RNA polymerization rate
such that the overall RNA polymerization rate is given by
and use the “tent” function for fitness, similarity, and fidelity. We take 10 realizations of hitting time
for each parameter vector
and allocate 5000 reactions. This gives 100 independent hitting times and up to 500,000 reactions. The times are comparable because the system is initialized to the same replication mass.
For the simulations, 66 hitting times are finite. The coefficients positively contribute to hitting, where
and
, both with
p-values less than 0.005. Therefore, hittings are strongly positively influenced by the parameters of the fitness and similarity functions and less so by the dimension. Plots of the coefficients and survival and cumulative hazard curves are given in
Figure A3.
High survival is found for k large and low survival for k small. Cumulative hazard is highest for k small.We estimate HDMR of the classifier (whether or not hitting time is finite) using all 100 samples. The results are shown in
Figure A5 and
Table 4. The HDMR explains roughly 80% of variance.
and
, so
hitting probability is strongly influenced by fitness landscape curvature k and less so by sequence length n. The component functions
and
for fitness landscape curvature and sequence dimension are strictly decreasing, where larger fitness landscape curvature parameter
k results in decreasing hitting probability. These results are consistent with the survival analysis.
We estimate HDMR of the regressor (hitting time) using the 66 simulations with finite hitting time. The results are shown in
Figure A4 and
Table 5. The HDMR explains roughly 60% of variance.
and
, so
sequence dimension dominates the hitting time. Both HDMR component functions
and
for sequence dimension and fitness landscape curvature are increasing. The HDMR results reveal that conditioning on hitting reverses the roles of sequence dimension
n and fitness landscape curvature
k.
3.3.2. Clay and Decay Model, for with and “Tent” Functions
We expand the model to include clay and decay. We take parameter vector
with double-strand dissociation and formation and clay oligomerization reaction rates
. For each parameter vector
, (i) we choose singleton high-fidelity replicator manifold
for some RNA sequence
and choose random initial population of RNA molecules
such that the initial population size is 10,
, and where the initial population does not intersect the high-fidelity manifold
, that is, the initial population does not reside on the high-fidelity manifold; (ii) we initialize the replicative mass of the system such that the initial overall RNA polymerization reaction rate is given by
and the initial overall clay RNA polymerization reaction rate
; (iii) we sample the hitting times
for volume fraction of high-fidelity replicators
a total of
times, each censored by 5000 reactions, giving hitting time set
of (
20). We attain input–output data set as
. This gives a total of 2400 simulations.
For the simulations, 1546 hitting times are finite. The results of fitting the Weibull–Cox model are shown below in
Table 6 and
Figure A6. The curvature parameter
k again significantly dominates with a large positive value. All the remaining parameters have values less than one. Sequence dimension
n again is a relatively small positive contributor. The clay fraction
is small and positive and replication fidelity parameter
p is not significant.
We estimate HDMR of the classifier (whether or not hitting time is finite) using all 2400 samples. Component functions and sensitivity indices are shown below in
Figure A7 and
Table 7. First-order HDMR captures 74% of explained variance, and second-order captures 4%.
Curvature dominates hitting probability with large sensitivity index . The HDMR component function in landscape curvature
is a decreasing function, where small values increase and large values decrease hitting probability. Sequence dimension
n has sensitivity index
, and the HDMR component function
is decreasing, where high dimension decreases the probability of hitting. Clay parameter sensitivity index is small
, and the HDMR component function for fractional clay RNA polymerization rate,
, is decreasing, where low-to-medium clay fractions increase and high-clay fractions decrease probability of hitting. The HDMR results are consistent with the Weibull–Cox model.
We estimate HDMR of the regressor (hitting time) using the 1546 simulations with finite hitting time. Component functions and sensitivity indices are shown below in
Figure A8 and
Table 8. First-order HDMR captures 33% of explained variance, and second-order captures 7%.
In stark contrast to the contributions to the classifier, the parameters k and n are insignificant to hitting time. Instead, the largest sensitivity index is . The HDMR component function for fractional clay RNA polymerization rate,
, is an increasing function, where small
decreases and large
increases the hitting time. This suggests that high clay-fractions representing first-order reactions increase the hitting time, as clay polymerization has less replicative mass than RNA polymerization, i.e., things go faster with RNA polymerization. The second largest sensitivity index is decay
. Decay is an increasing function, with sharp increase in hitting times nearby one, i.e., things go slower with large decay resulting in increased hitting time.
3.4. Compartmentalization
Compartmentalization has a direct effect on the calculation of the reaction rates, specifically replication, by computing only a subset of the reactions in
. Put
For vesicle region
, we have that
and total replicative mass
As
increases in size over time, the number of partitions grows, and
Therefore, the replicative mass will be reduced with , and the system evolves less quickly. This suggests that there is a trade-off between the degree of compartmentalization and the replicative mass of the system.
4. Discussion and Conclusions
Origins of life is a fascinating problem. The wonderful complexity of extant life follows from origins. The distribution of life in the universe is tied to origins.
In this article, we have attempted to peek into the problem by concentrating on the RNA world hypothesis, studying hitting times of high-fidelity replicators. We develop fitness, similarity, and fidelity functions as landscapes for a mathematical model of replicating RNA molecules at the sequence level and observe hitting times through simulation studies. We draw attention to the distinction between the probability of hitting and the hitting time .
In terms of mathematical set-up, we interpret the reactions as measure-kernel-functions. Each reaction is identified to and fully encoded by a probability transition kernel. The reactions take place in some domain, whereby all molecules may interact. We note that, in reality, molecules have limited diffusion, and this effectively breaks the reaction domain into independent subdomains above some length scale, i.e., molecules are more likely to react with their neighbors. Therefore, we assume our reaction volume is sufficiently small such that all molecules may participate in the reactions. We use for modeling purposes the ansatz that sequence distance is correlated to spatial proximity, where similar sequences are proximal, using a non-trivial similarity function .
Theorem 2 and its Corollary 2 show that the system (without decay) is unbounded and strictly increases. This formally shows the system to be a growth process. Next, we illustrate findings about the core system with probable hitting (
Section 3.2.1). In particular, we see that the temporal image measure
, which describes the polymerization output, is a leading indicator of high-fidelity sequence concentration. Polymerization output and high-fidelity replicators super-exponentially increase near the end of the simulation. Next, the fitness and sequence curvature parameters are set at a higher value which confers reduced fidelity (
Section 3.2.2). This reveals that hitting is never achieved and that the polymerization output is in equilibrium with population composition. Hence, the probability of hitting is strongly influenced by landscape curvatures. Next, linear curvature is utilized for fitness and similarity and results in no hitting (
Section 3.2.3). This reveals that nonlinear curvature is necessary to achieve hitting of high-fidelity replicators. Next, we expand the model with non-RNA (‘clay’) based polymerization and find that such activity decreases over time, in contrast to RNA polymerization, which greatly increases and dominates other reactions over time the system (
Section 3.2.4).
For functional and survival analysis of the hitting times, we study the core model, whereby hitting times are strongly positively influenced by the fitness and similar functions yet are not impacted significantly by sequence dimension (
Section 3.3.1). In particular, survival analysis reveals low fitness curvature confers low survival (high hitting), whereas high fitness curvature confers high survival (low hitting). HDMR analysis shows that hitting probability is strongly influenced by fitness curvature and much less so by sequence dimension, supporting the survival analysis. HDMR analysis of hitting time shows reversed roles for sequence dimension and landscape curvature, where sequence dimension dominates hitting time, with curvature playing a far less significant role. This gives the finding that hitting probability is driven by curvature, whereas hitting time is driven by sequence dimension. Next, we perform functional and survival analysis of the core model augmented with ‘clay and decay’ dynamics (
Section 3.3.2). Survival analysis shows similar results to the core model, where curvature dominates survival (no hitting), with sequence dimension playing a significantly reduced role. HDMR analysis of hitting probability shows that curvature dominates hitting probability, similar to the core model, whereas sequence dimension again plays a significantly reduced role. HDMR analysis of hitting time reveals that the presence of ‘clay and decay’ significantly increase hitting time, with curvature and sequence dimension playing insignificant roles. These results are consistent in that clay polymerization has less replicative mass than RNA polymerization, where RNA polymerization is a faster reaction.
Overall, we find that nonlinear landscapes are necessary for hitting: linear landscapes are insufficient. For nonlinear landscapes, we find that the probability of hitting is dominated by curvature and that hitting times are dominated by sequence dimension. These results suggest that the landscapes in nature are nonlinear with high curvature, and that the hitting time for high-fidelity replicators is an increasing function of sequence dimension. When clay and decay are added to the model, hitting probability is again dominated by curvature, and clay and decay are relatively insignificant. This reflects that clay and decay are low order reactions. They increase hitting times.
For replication and compartmentalization, we suggest that compartmentalization, while a necessary condition, slows overall system dynamics with increasing vesiculation rate. Essentially, as compartmentalization increases, there is a corresponding reduction in absolute replicative system mass, as certain reactions among elements are no longer possible, being physically sequestered. While the timescale of a simulation is tied to the replicative system mass of the system, there is variability in replicative mass across compartments. Some compartments contain large genomic and metabolic populations. It favors the search for the high-fidelity replicator by there being a distribution on compartmental ‘fitness’ such as resource concentrations so that the high-fitness compartments drive replicative system mass. Compartmentalization is identified to the measure on coordinates in , for which coordinates are “marked” by sequences through the transition probability kernel , and followed by genomic adaptation via the transition probability kernel .
Metabolism is thought to be identified to production of precursors to RNA synthesis, leading to replication identification, followed by genomic adaptation to metabolic state. Metabolism is thus defined through the transition kernels and .
The independence of the transition kernels can be scrutinized, and it is possible that general transition kernels on the full product spaces across location, genomic, and metabolic states are necessary to satisfactorily explain RNA origins, i.e., all three functions may have co-evolved. This notion is suggested in the hot springs hypothesis for origins, where compartmentalization is hypothesized to furnish necessary conditions to genomic and metabolic state [
26,
27,
28]. In this telling, the base measurable space of interest is
with measure
. Then, identification of a high-fidelity replicator is described through the transition probability kernel
from
into
in “marking” the base measurable space with state for genomic replication; finally, genomic adaptation is conveyed through the kernel
from
into
. Hence, RNA origins of life has law
on the product space
, reflecting the steps of replicator identification and adaptation through the definitions transition kernels
and
.
A putative “genesis machine” here is a mapping from the base (initial) measurable space
into the product space of identified high-fidelity replicators undergoing adaptation, i.e.,
. More generally the base space could additionally contain amino acid sequence space
. Such a machine is fully specified through the definitions of the distribution
on the base space and the transition kernels
and
(pre and post genes, respectively). Because the stages of transition occur purely through random drift, an experiment performed by such a machine would take an unacceptably long period of time to complete. Experimental demonstration can be contemplated by augmenting the base measurable space with a control space
to accelerate dynamics, using for instance closed-loop shaped radiation to address molecular degrees of freedom in their appropriate frequency domains, (open-loop) catalysts, temperature, geometry, selection, concentration through centrifugal force, etc., resulting in the new (four-dimensional) base space
. Then, the transition kernels
and
become mappings from
into
and from
into
, respectively. The general design of a genesis machine is the definitions of the augmented base space
, its distribution
, and the augmented transition kernels
and
, giving law
on the full 15-dimensional product space
, written in differential notation
Origins could be experimentally demonstrated using a sequence of adaptive control fields in , cycling through the transitions, and a detection system for online identification of the system whose elements belong to the product measurable space. The full design and estimated operating timescale for such a machine needs further research to assess practical feasibility. We call the creation of primordial life (primordia) by the continuous causal efforts of a genesis machine given initial prebiotic conditions artebiogenesis, where arte- is Latin and means “from skill.” The primordia are not necessarily those that occurred in nature. Primordia and their genesis represent non-trivial system trajectories across the transition to the earliest life in sterile environments and belong to a manifold of primordial lifeforms, each having characteristic geochemical setting.
The probability measure
has additional utility to integrate ‘test’ functions or queries about the system. If we let
be a fitness function, then the fitness value
is the expected value of the fitness function with respect to the probability measure
. In OptiEvo theory,
is studied as a function of the population
on
[
52]. OptiEvo assumes that the set of all probability measures
is convex and that
has sufficient flexibility such that
may be explored around
. Then, OptiEvo predicts that
has global maxima on
and that these form a connected level-set of sequences with the same fitness value. Both predictions are consistent with our model. The first prediction is consistent with zero distance in fitness and similarity functions for high-fidelity sequences. The second prediction is consistent with the high-fidelity set being a singleton or a product-space construction. A contention is whether
has sufficient flexibility in exploring
around
. This has direct bearing on the structure of
Q: if
is inflexible, then
Q is constrained to certain subspaces of
, i.e., not all transitions are possible.
In future work, the model could be extended to the space of sequences of lengths up to n, or even the space of sequences of all lengths, where distance and similarity functions would utilize a more general string distance metric, e.g., Levenshtein distance. We note that the size of is not much larger than . Alternative similarity functions could be explored, such as the trivial case of constant similarity, e.g., for . A limitation of this article is the restriction to short sequences due to computational efficiency. The numerical size of the sequences is mathematically low-dimensional and does not correspond to actual functions of RNA molecules. Other parameter sets can be explored for example using experimentally derived values for reaction rates, so that the timescales are calibrated. Future research could see the simulation software rewritten for a high-performance computing environment, enabling much longer, e.g., length 10–1,000, sequences to be studied. Polymerase fitness can be made empirical using known RdRp sequences as members of the high-fidelity manifold. Another area of future work could be studying the aforementioned transition kernels , , , and . More general models for polymerization transition kernel based on the structure of the Poisson-binomial distribution could be employed. It would be interesting to study lipid-RNA and metabolism-RNA interactions and equip the system with the ability to append nucleotides to their sequences to form functional genes, such as storing useful information for the replication channel, perhaps a Hammerhead ribozyme to convey rolling-circle amplification. We note that transition kernels here generally lack amino acid state and are pure-RNA. An area to explore is the notion of the transition kernel into the space of high-fidelity replicators to depend on amino acid sequences and then to elaborate the system to contain a primitive translation system and examine various hitting times. Additional reactions can be introduced as operations on pairs of sequences, such as concatenation, and others for sequence splitting, and so on, with corresponding transition kernels enabling RNA networking and recombination dynamics.