When designing primers for multi-point mutagenic primers, the consideration of primer constraints is an important key to the success or failure of the experiment. In PCR experiments, people (operators), things (research projects), time (operation time), places (location and environment) and objects (materials used) are often the reasons for different experimental results.
2.2. Functionalization of Multi-Point Mutagenic Primers and Design of API Library for Computational Intelligence Methods
In order to provide future research and development of multi-point mutagenic primer design algorithms, and to facilitate the continuous use of these improved multi-point mutagenic primer design conditions, this study functionalized these improved multi-point mutagenic primer design conditions and planned them in the API (Application Programming Interface) library to provide relevant computational intelligence methods and reuse by researchers in the future. The API function library of these related definitions includes primer length function, length difference function between forward and reverse primer, ratio function of primer ‘G’ and ‘C’ nucleotides, primer 3′ end function, melting temperature calculation function, melting temperature difference function of forward and reverse primer to template sequence, repeat calculation function of the same nucleotide in primer, primer dimer evaluation function, primer hairpin evaluation function, specificity evaluation function, product length calculation function, etc. In defining these functions, first, the solution of the mutagenic primer design problem is represented by
s. The overall template sequence is mainly based on the nucleotide base composition of ‘A’, ‘T’, ‘C’ and ‘G,’ and considers the inclusion of multiple SNPs. It is defined as formula (1):
where
represents the template sequence;
B is the nucleotide base;
i represents the position of the nucleotide base.
The forward primer
, i.e., the mutagenic primer, and the reverse primer
are defined as the following Formulas (2) and (3), respectively.
where
represents the forward primer;
represents the reverse primer;
and
represent the starting position and end position of the forward primer, respectively;
and
represent the starting position and end position of the reverse primer. Here, it should be noted that the forward primer mainly contains the subsequence of the SNP’s base at the 3′ end, and the reverse primer does not contain the subsequence of the SNP’s base.
The following defines the conditional functions for the design of mutagenic primers:
- (1)
Primer length function
The primer length will cause the melting temperature between the primer and the template sequence to increase or decrease, and it will also indirectly affect the generation of secondary structure and the specificity of the primer. Therefore, this study functionalizes the primer length function, which is expressed as follows:
where the
represents the length of the forward and reverse primers; the
represents the minimum primer length designed; the
represents the maximum primer length designed.
- (2)
Primer length difference function
The designed lengths of the forward and reverse primers are not necessarily the same. Many studies have found that a length difference between both of the primers of less than 3 nt is the most ideal state, but due to some special requirements, the length difference must be made within a certain length. In order to allow the function to flexibly adjust the length difference, the length difference is represented by
n here, and the function is expressed as follows:
where the
represents the length difference between primers; the
represents the forward primer length; the
represents the reverse primer length;
n represents the length difference.
- (3)
The ratio function of primer ‘G’ and ‘C’ nucleotides
The ratio of ‘G/C’ nucleotides in primers is generally found in the literature to be between 40% and 60% for optimal adhesion. In order to allow users to adjust according to their needs, we provide the ratio function of primer ‘G’ and ‘C’ nucleotides as follows:
where the
is the minimum G/C ratio value; the
is the maximum G/C ratio value; the
PGC function represents the proportion of ‘G/C’ nucleotides in the forward and reverse primers.
- (4)
The 3′ end function
Usually the 3′ end of a primer is designed with ‘G’ or ‘C’ nucleotides, because ‘G’ or ‘C’ nucleotides have a stronger bond than ‘A’ or ‘T’ nucleotides in DNA structure. Considering the nucleotides at the primer 3′ end, the function is expressed as follows:
where the
function represents the nucleotides at the 3′ end of the forward and reverse primers.
- (5)
Melting temperature calculation function
The melting temperature (
Tm) is the temperature at which the primer separates from the template. In this study, the thermodynamics theory proposed by SantaLucia [
13] is functionalized, which is also the most accurate melting temperature equation identified by researchers, as follows:
where
is enthalpy;
is entropy correction;
R is gas constant (1.987 cal/Kmol);
C is DNA concentration.
- (6)
The melting temperature difference function
The difference of Tm between the forward and reverse primers and the template sequence is generally considered to be no more than 55 °C; otherwise, the primer will not be properly attached to the template sequence. In order to allow users to set flexibly, the function of the Tm difference between both of the primers and the template sequence is expressed as follows:
where the
function represents the Tm difference between the forward and reverse primers and the template sequence; the
represents the Tm difference between the forward primer and the template sequence; the
represents the Tm difference between the reverse primer and the template sequence;
n represents the maximum difference of Tm.
- (7)
Repeat calculation function of identical nucleotides in primers
When a specific nucleotide within a forward or reverse primer is repeated multiple times, such as ATATATAT, this situation may result in errors in annealing to the template sequence. Therefore, the repetition calculation function for the same nucleotide within a primer is expressed as follows:
where the
function is the evaluation function of repeated nucleotides in the primer;
r represents the maximum allowable number of repetitions.
- (8)
Primer dimer assessment function
When both of the forward and reverse primers are in a dimer structure due to the composition of nucleotides, it can easily cause the primer to fail to bind to the template sequence. Here, the primer dimer evaluation function is expressed as follows:
where the
function is to evaluate the dimer of the forward and reverse primers; the cross dimer is the mutual adhesion of the forward and reverse primers; the self dimer is the mutual adhesion of the forward and forward primers or the reverse and reverse primers.
- (9)
Hairpin evaluation function
The main reason for the formation of hairpin bends is that the nucleotide structure of the forward primer or the nucleotide structure of the reverse primer are polymerized before and after each other, which easily leads to the failure of the primer to adhere to the template sequence and, thus, the failure of the PCR experiment. The evaluation function for the formation of primer hairpin bends is expressed as follows:
where the
Phairpin function is to evaluate the hairpin bending situation of the forward primer and the reverse primer itself.
- (10)
Specificity assessment function
Specificity is a very important property in primer design. Primers with specificity can bind to specific template sequence positions in order to generate desired products. Therefore, this study includes the introduction specificity evaluation function, which is expressed as follows:
where the
function evaluates that the forward and reverse primers must be able to bind to a specific template sequence position.
- (11)
Product length function
The length of the product is the key to identifying the success or failure of a PCR experiment. Usually, the length of the product that can be recognized at present is about 150 bp. However, in order to provide researchers with flexible settings, the product length calculation function is expressed as follows:
where the
function represents the PCR product length formed by the forward and reverse primers;
n represents the size of the product.
2.3. Developing a Search Strategy for Primers Suitable for Multi-Point Mutagenic, Importing High-Throughput Computing and Improving Computing Efficiency
This research develops a search strategy suitable for multi-point mutagenic primers and integrates it into the TLBO method in order to improve the quality of multi-point mutagenic primer design. At the same time, high-throughput computing processing is introduced in order to improve computational efficiency of multi-point mutagenic primer design. Teaching–learning-based optimization, an algorithm inspired by the teaching and learning process, is proposed by Rao et al. [
7,
8,
14]. This method is mainly based on the teacher’s role in influencing learners’ effectiveness in the classroom and the concept of learners’ mutual self-learning. There are two main basic learning modes: (1) teaching through teachers (teaching phase) and (2) interacting with other learners (learning phase). In the solution process of this method, the subjects studied by each learner are regarded as different design variables of the optimization problem, and the learning outcomes of the learners are similar to fitness values of the optimization problem. In order to make this method more effective in the design of multi-point mutagenic primers, this study develops a search strategy suitable for multi-point mutagenic primers to help this method more effectively escape the shortcomings of the genetic algorithm falling into the optimal solution, and it can also effectively improve the quality of multi-point mutagenic primer design.
- (1)
Multi-point mutagenic primer design
Figure 1 is a conceptual diagram of multi-point mutagenic primer design. Firstly, according to the multi-point mutagenic primer design, the solution should be solved by the teaching–learning-based optimization method. In this study,
M,
Fl,
Pl,
Rl,
Fm and Σ are used as learner codes to design multi-point mutagenic primers. The symbol
M represents the mutagenic nucleotide number;
Fl represents the mutagenic primer length;
Pl represents the product size;
Rl represents the reverse primer length;
Fm represents the mutagenic nucleotide position, and its position index can be from 1 to max-M; max-M represents the maximum number of mutagenic nucleotides; Σ represents an integer value of 0, 1, 2 or 3, and its individual meaning represents the mutagenic nucleotides ‘A’, ‘T’, ‘C’ or ‘G’. The learner is coded as follows:
In order to improve the design efficiency of multi-point mutagenic primers, we developed REHUNT (Restriction Enzymes HUNTing) API [
11], using JAVA to carry out complete and specific restriction enzymes mining work.
In addition, we also applied the technology of the single-point mutagenic matrix [
6,
10] to propose a multi-point mutagenic matrix to completely record the restriction enzymes mining results of the target SNP. The single-point mutagenic matrix contains a total of (
Fm_max − Fm_min + 1) × 4 elements, while the multi-point mutagenic matrix contains
Fm max-M single-point mutagenic matrices, as shown in
Figure 2. The row indicates the mutagenic position, which is between (SNP—
Fm_max + 1) and (SNP—
Fm_min + 1). The column represents the four bases of the single-point mutagenic nucleotides ‘A’, ‘T’, ‘C’ and ‘G’. In the single-point mutagenic matrix, the internal element values are 0, 1, 2 or 3, respectively, and their meanings are expressed as follows:
- 0:
Restriction enzymes available on both strands that can distinguish the target SNP when the single-point mutagenic nucleotide is determined.
- 1:
Restriction enzymes available only on one of the double strands that can distinguish the target SNP when the single-point mutagenic nucleotide is determined.
- 2:
No restriction enzyme available on either strand that can distinguish the target SNP when the single-point mutagenic nucleotide is determined.
- 3:
No single-point mutagenic nucleotide, and the original nucleotide is still maintained.
- (2)
Search strategies
In terms of search strategies, local search, elite search and interactive learning are added to the TLBO method [
14]. The purpose of adding the local search strategy is to find better learners near the original learners. The knowledge of the original learner is improved by learning the knowledge of the better nearby learner, so the local optimal solution centered on the learner can be obtained. In order Tt have a better solution for the learners when first using the TLBO method, a local search of adjacent learners is first performed for the learners in the initial learning group. Afterwards, local searches are also performed after subsequent learner-interactive learning phases. In each iteration of the local search process, all children of the learners will participate in the local search; thus, the local optimal solution of each learner is continuously retained, and finally, the global optimal solution is obtained. The pseudo code (Algorithm 1) of the learner’s local search strategy used in this study is shown below. In this pseudo code,
S represents a group, and
d is used to assist learners to search for their neighbors. Before calculating
d, the constant
a must be determined for the variable values of the different problems. In this study, the constant
a of the local search will be determined by the length difference between the maximum and minimum primer length in order to obtain a better range.
The pseudo code of the local search:
Algorithm 1 Local search [3] |
1 | Begin; |
2 | Select an incremental value d = a × Rand(); |
3 | For a given learner i ∈ S: calculate achievement(i); |
4 | For j = 1 to number of variables in learner i |
5 | value(j) = value(j) + d; |
6 | If achievement(i) in learner i is not improved then |
7 | value(j) = value(j) − d; |
8 | Else if achievement(i) in learner i is improved then |
9 | Retain value(j); |
10 | Next j; |
11 | Next i; |
12 | End; |
Where d is the search range; a is a variable that is dependent on the problem; S represents the entire learning group; achievement is the learning outcome.
In addition, the purpose of adding the elite search is to provide a higher probability of being a teacher to teach the rest of the learners through several excellent learners, and to use n elite learners as auxiliary teachers at the same time in order to promote convergence of learning process and achieve better learning outcomes. Therefore, before and during the iteration of the TLBO method, the elite learners in the learning group will be given priority to serve as teachers with a higher probability, so as to obtain better results in the learning process. The pseudo code (Algorithm 2) for the elite search used in this study is shown below.
The pseudo code of elite search:
Algorithm 2 Elite search |
1 | Begin; |
2 | For i = 1 to number of learners |
3 | Initialize teach_rate(i) = 0; |
4 | For j = i + 1 to number of learners |
5 | If achievement(i) < achievement(j) then |
6 | teach_rate(i) = teach_rate(i) + (1/number of learners); |
7 | Next j; |
8 | Next i; |
9 | Find n better learner according to teach_rate(i) from learners; |
10 | Set n better learner as the assistant teachers; |
11 | End; |
Where teach_rate is the probability of being selected as a teacher; n is the number of auxiliary teachers.
Finally, the interactive learning helps learners to interact with others in the learning group. If the learners who want to interact have more knowledge than they themselves do, the learners will acquire new knowledge by learning with them. This strategy takes into account real-life peer learning and is ideal for TLBO methods. By interacting with learners in other overall groups, learners can avoid being limited to only acquiring regional knowledge, and can acquire global knowledge centered on this learner. The pseudo code (Algorithm 3) proposed in this study to conduct an interactive search is shown as follows:
The pseudo code of the interactive search:
Algorithm 3 Interactive learning |
1 | Begin; |
2 | For a random learner P ∈ S: calculate achievement(P); |
3 | For j = 1 to number of variables in learner i |
4 | Randomly select a learner Q ∈ S and Q ≠ P; |
5 | Temp = value(P, j); |
6 | If achievement(P) < achievement(Q) |
7 | value(P, j) = value(P, j) + ri (value(P, j) − value(Q, j)); |
8 | Else |
9 | value(P, j) = value(P, j) + ri (value(Q, j) − value(P, j)); |
10 | If achievement(P) in learner P is not improved then |
11 | value(P, j) = temp; |
12 | Else if achievement(P) in learner i is improved then |
13 | Retain value(P, j); |
14 | Next j; |
15 | End; |
Where k represents the learners who want to interact; S represents the whole learning group.
- (3)
High-throughput computing
The steps of high-throughput computing are described as follows:
Step 1. Learner code for mutagenic primer solution.
First, learner coding is performed for the problem of mutagenic primer design. Each learner represents the solution of a mutagenic primer, and each variable in the learner represents the subject it learns.
Step 2. Import the high-throughput target SNP template sequence.
Next, the template sequences of the high-throughput target SNPs must be imported to be used as primers for designing high-throughput multi-point mutagenic primers. The template sequence of the imported high-throughput target SNP is recommended to be at least 250 bps adjacent to the target SNP, and the full length must be at least 501 bps to provide sufficient sequence for the selection of optimized primer fragments.
Step 3. High-throughput computational analysis processing.
In order to allow subsequent algorithms to process the template sequences of high-throughput target SNPs, this study uses a distributed programming framework to design and build a computing cluster system for analysis and processing.
Step 4. Calculate the multi-point mutagenic matrix.
First, the developed REHUNT of restriction enzyme mining API is used to search for target SNP restriction enzymes in order to generate a multi-point mutagenic matrix, so that the mutagenic restriction enzyme information can be efficiently and directly accessed during the iteration of the algorithm.
Step 5. Determine the presence or absence of restriction enzymes.
After judging whether the target SNP is subjected to multi-point mutagenic design, the available restriction enzyme is determined for identification of the target SNP position.
Step 6. Initialize the learning population.
If the target SNP has available restriction enzymes, tens to hundreds of s learners are randomly generated as the initial learning population. The value of Fm is randomly generated between (SNP—Fm_max + 1) and (SNP—Fm_min + 1). The value of Fl is randomly generated between the minimum and maximum primer lengths, according to common primer length constraints. In addition, in order to limit the PCR amplicon size, the value of Pl is randomly generated between the minimum and maximum product length. The value of Rl is generated in the same way as for Fl. The value of Σ is randomly generated between 0, 1, 2 or 3, and its individual meaning represents the mutagenic nucleotide ‘A’, ‘T’, ‘C’ or ‘G’.
Step 7. Do a local search.
In order to enable the learners at the beginning to have good quality, the initial local search will be carried out for the learners in the initial learning group. Afterwards, a local search will also be performed after subsequent learner-interactive learning phases. Through the iterative process, all children’s learners will participate in the local search, so that the local optimal solution of each learner is continuously retained, and thus, the global optimal solution is determined.
Step 8. Perform an elite search.
Before and during the iteration of the teach-learning optimization method, the algorithm will seek out elite learners in the learning population and provide a high probability of becoming teachers to teach the rest of the learners. With elite learners as auxiliary teachers, the convergence of the learning process will be promoted and better learning outcomes will be achieved.
Step 9. Teach Phase.
During this phase, teachers try to increase the average learning outcomes of learners in the subject courses they teach according to their abilities. At any iteration
i, assume there are
m number of subjects (i.e., design variables),
n number of learners (i.e., learning population size,
k = 1, 2,…,
n) and
Mj,i for learners with an average learning outcome for a particular subject
j (
j = 1, 2,…,
m). The best overall learning outcome
Xtotal-kbest,i considers the best learning outcomes of the entire learner group in all subjects, and takes
kbest as the best learner. Under normal circumstances, a teacher is usually considered a person with a high learning ability to train learners for better learning outcomes. The differences between the existing average learning outcomes for each subject and the corresponding learning outcomes for teachers in each subject are given as follows:
where
Xj,kbest,i are the learning results of the best learner (i.e., teacher) in subject
j;
TF is the teaching factor, which determines the change in the mean;
ri is a random number in the interval [0, 1]. The
TF value can be 1 or 2, randomly determined with equal probability as follows:
Here, the TF value is not used as an input to the algorithm, and its value is randomly determined by Equation (17). It is pointed out in the literature that the TLBO algorithm is superior in performing many benchmark function simulation experiments with TF values between 1 and 2. However, in the simulation experiment, it was found that a TF value of 1 or 2 is more suitable for solving the algorithm. Therefore, in order to simplify the algorithm, it is recommended to use 1 or 2 for the teaching factor.
Step 10. Update Learning Outcomes.
Based on
Difference_Meanj,k,i, the existing solution is updated in the teach phase according to the following formula:
where
X’j,k,i is the updated value of
Xj,k,i. The algorithm accepts
X’j,k,i if it provides a better function value. All accepted function values are retained at the end of the teaching phase, and these values become the learner’s input to the interactive learning phase, which depends on the teaching phase.
Step 11. Interactive learning phase.
After a certain period of iteration, the learner can randomly interact with the learners in the learning group in order to improve the learner’s own knowledge and escape from the optimal solution in the region. We express the learning phenomenon in this stage as follows:
Randomly select two learners
P and
Q such that
X’total-P,i ≠
X’total-Q,i, where
X’total-P,i and
X’total-Q,i are the updated values of
Xtotal-P,i and
Xtotal-Q,i at the end of the teacher phase. The interactive learning update is shown in Equations (19) and (20).
Step 12. Assessment of Learning Outcomes.
Through the designed learning outcome function, each learner will be evaluated, and each learner will have a corresponding learning outcome value. Here, the learning outcome function was evaluated based on the design conditions of the multi-point mutagenic primers.
Step 13. Terminate condition judgment.
During the iteration process, the algorithm will determine whether the current learner’s learning outcome value has been minimized, that is, a learning outcome value of 0 indicates that the optimal learning outcome value has been reached, or the calculation process reaches the preset number of iterations, and the calculation is stopped.