1. Introduction
Multiple sequence alignment (MSA) is one of the central tasks of bioinformatics. Significant efforts have been made to develop tools for MSA [
1,
2,
3], which often include dynamic programming, progressive alignment, iterative methods, hidden Markov models (HMMs), and genetic algorithms [
4,
5]. The direct use of dynamic programming is hardly possible for the alignment of a large number of sequences (more than 10), since it is an NP-complete problem [
6] and the calculations require considerable time. Therefore, some heuristic solutions have been developed, which are based on using the objective function for assessing the power of MSA and an optimization procedure. As a result, the quality of the constructed MSA is improved [
2].
The progressive alignment is the most popular MSA algorithm. It includes pairwise comparison of
N sequences, calculation of a matrix of distances between the sequences, construction of a matrix-based guide tree, and, finally, progressive MSA. Often, an optimization procedure is applied to the created progressive alignment to improve the final result and eliminate unnecessary deletions and insertions (indels). The sum of the similarity functions of pairwise alignments, which may be used as an objective function, can be obtained by projecting the constructed MSA on two-dimensional planes [
7,
8], resulting overall in
N(
N−1)/2 of pairwise alignments. Alternative target functions such as consensus [
9], entropy [
10], or circular sum [
11] are also used. However, in most cases, progressive alignment cannot create the optimal MSA since the errors obtained at any stage of the procedure are accumulated at the final alignment. To minimize such errors, an optimization procedure that utilizes various mathematical algorithms is used. The most popular programs based on the progressive alignment are Clustal [
12,
13], MAFFT [
14,
15], and T-Coffee [
16,
17].
Iterative methods are another popular approach to construct MSA as they allow reducing the errors inherent in progressive alignment. When a new sequence is added to the MSA, the total alignment is recalculated, which is in contrast to progressive alignment when each originally calculated paired alignment is final. As a result, the error rate is reduced and the objective function is optimized. The most popular programs that employ iterative procedures are Muscle [
18], PRRN [
19], and CHAOS/DIALIGN [
20].
HMMs, which are also used to construct MSA, consider the probabilities of all possible states, i.e., match, insertion, or deletion, for each character [
21,
22]. Although the Markov models, including the hidden ones, work, in fact, as a “replicating” method, they may improve the calculation speed. In the process, an initial MSA is constructed and used to determine the HMM parameters, which are then improved through iteration. Therefore, the application of HMMs to MSA is accompanied by several other optimization procedures and mathematical methods, often of an iterative nature, such as the Baum-Welch algorithm [
10], simulation of annealing [
23], gradient descent [
24], etc. After optimization, an optimal Markov model is created and all sequences are aligned with it using the Viterbi algorithm [
10] to find the best similarity between each sequence and the HMM.
The described approaches construct a statistically significant guide tree for progressive alignment based on pairwise alignments between sequences. However, in the case of highly divergent sequences, it is often not possible to find a statistically significant “germ” or common “words”. Therefore, it is currently extremely difficult or impossible to compute the MSA of highly divergent sequences.
In the case of highly divergent sequences, when each sequence pair has more than 2.5 substitutions per position (x), MSA becomes statistically significant only for a relatively large sample containing more than 10 sequences. MSA can be built using the N-dimensional dynamic programming for all the analyzed sequences, but it requires significant computational resources and is currently impossible to implement, suggesting that the available MSA approaches need improvement in this direction.
To address this problem, we have developed a new mathematical method for computing the multiple alignment of highly divergent sequences (MAHDS), which is partially based on previously described tandem repeat search algorithms [
25,
26]. Here, we upgraded such an algorithm and dynamic programming to compute the MSA by taking into account the correlation of adjacent characters. We have also developed a website for the calculation of a multiple alignment for very distantly related sequences using this method (
http://victoria.biengi.ac.ru/mahds/main). The advantage of MAHDS is that it can produce a statistically significant alignment of a set of divergent nucleotide sequences for which any pairwise alignment does not reach sufficient statistical significance.
The analysis of the performance of currently available MSA programs, including ClustalW [
27], Clustal Omega [
28], T-Coffee [
17], Kalign [
29], MAFFT [
14], and Muscle [
18] in aligning nucleotide sequences depending on the degree of their evolutionary divergence (
x) revealed that they are effective at
x < 2.5. In comparison, the MAHDS program can construct a statistically significant MSA with
x up to 3.7 for the number of sequences up to 100 and if the sequence set increases to 500, the statistically significant
x limit is increased to 4.4. MAHDS was used to align promoter sequences from the
Arabidopsis thaliana L. genome (from −499 bp upstream to +100 bp downstream of the transcription start site) and revealed that many upstream regions (from −499 to +1 bp) are highly conserved. A significant conservation was also observed in the regions from +1 to +70 bp. Furthermore, MAHDS provided identification of 25 promoter classes in the
A. thaliana genome. The possibility of using the developed mathematical method and the calculated multiple alignment to identify promoter sequences in different genomes is discussed.
2. Methods and Algorithms
2.1. General Description of the MAHDS Algorithm
The main idea behind the MAHDS method is the construction of the optimal image rather than direct calculation of the MSA. Here, we used position-weight matrix (PWM) as such an image. For example, we have a set of sequences
sq(1),
sq(2),…,
sq(
N), which we combine into one sequence
S of length
L and then determine the best alignment between
S and PWM that was originally calculated based on random sequence alignment and has the dimension of 4 ×
L/
N. Then, sequence
S is aligned against PWM using the Needelman-Wunsch algorithm [
30] (for an example, see publication [
31]) and the similarity between
S and PWM is measured based on
F(
L,
L) calculated by two-dimensional dynamic programming at the point (
L,
L). In the two-dimensional matrix
F with the size
L ×
L, sequence
S is plotted in the X axis and the PWM, multiplied
N times, in the Y axis. As a result, sequence
S will be compared by dynamic programming with sequence
S1, which contains the column numbers of the PWM matrix treated as characters and looks as 1, 2,…,
L/
N, 1, 2,…,
L/
N,…. We also created
S2, which is a randomly shuffled sequence
S.
The purpose is to determine PWM with the greatest F(L,L), which then will be referred to as the best PWM. Using the created two-dimensional alignment, we can easily reconstruct the multiple alignment of the initial sequences. Therefore, MAHDS computes MSA by calculating F(L,L), finding the best PWM, and aligning sequence S to PWM.
As it is extremely unlikely to obtain a good approximation of the optimal MSA from the first random PWM, we determined the best PWM using an optimization procedure. For this, we generated a set of random PWMs (
Q) (
Figure 1, point 1), which contained
n1 random matrices, aligned matrix number
i from set
Q with sequence
S, and calculated the corresponding
F(
L,
L) value in position
i of vector
V(
i) (
i = 1,…,
n1) (
Figure 1, point 2). Then, vector
V(
i) was sorted in the ascending order and the matrices from set
Q were arranged in accordance with the position of the corresponding value in
V(
i).
Next, the genetic algorithm was applied to introduce random mutations into the matrices from
Q and create hybrids (descendants) (
Figure 1, point 3). The matrix corresponding to value
V(1) (the smallest
F(
L,
L)) was excluded from set
Q and replaced with the descendant. Then, the values of vector
V(
i) were recalculated and
V(
i) re-sorted. The resultant matrix corresponded to the largest
F(
L,
L) contained in
V(L) and was denoted as
maxF(
L,
L) (
Figure 1, point 2). If the
maxF(
L,
L) value decreased, then the process of matrix optimization was considered complete, if not—it was repeated. As a result, we obtained
maxF(
L,
L), two-dimensional alignment of sequences
S and
S1, and the best PWM (
Figure 1). The steps of this algorithm are described in detail below.
2.2. Creation of the Set of Random PWMs
We used random sequences to obtain a set of random PWMs (
Q;
Figure 1, point 1), which consisted of
L/
N columns and 16 rows. For this, we generated a random nucleotide sequence
S2 of length
L with equal base probability (0.25), transformed it into a numeric sequence (where bases a, t, c, and g were coded as 1, 2, 3, and 4, respectively), and compared it with sequence
S1 of the same length
L containing
N repeated sequences (1, 2, …,
L/
N). Then, we filled frequency matrix
M(
L/
N, 16) as:
where
i ranges from 2 to
L.
This approach of matrix construction takes into account two effects in MSA [
32]: The frequency of nucleotides at each position and the correlation of nucleotides in neighboring positions (
i and
i − 1), which allows multiple alignment of highly diverged sequences where base frequencies in each MSA position may not differ from those in sequence
S, while maintaining the correlation of neighboring nucleotides. An example is shown in
Figure 2, where short sequences were used for convenience. However, all conclusions are applicable to sequences of any length. Let us consider the following 4-nt sequences: Atcg, tagc, cgat, gcta, atat, tata, gcgc, cgcg, atta, taat, cgta, and gcat, which were constructed with dinucleotides at, ta, cg, and gc at positions 1–2 and 3–4. As a result, in each position of the MSA shown in
Figure 2, the nucleotide frequency is 0.25, but nucleotide pairs 1–2 and 3–4 could be only at, ta, cg, or gc. The probability that the first and second columns contain only at, ta, cg, or gc is 0.0625 × 4 = 0.25 and the number of lines in the alignment is 12. Then, the expected number of each nucleotide pair is: 12 × 0.25 = 3. If a normal approximation for the binomial distribution is used, then the variance is 12 × 0.25 × (1 − 0.25) = 2.25. In total, the first two columns contain 12 pairs (at, ta, cg, or gc). Then, the normal distribution argument for this event is
= 6 and the probability due to purely random factors is less than 10
−6. Given that this value is the same for columns 3 and 4, the probability of accidentally creating the alignment shown in
Figure 2 is less than 10
−12, which means that the individual columns in
Figure 2 appear to be random, but there is a strong correlation between the adjacent bases.
Then, we calculated PWM matrix
W(
L/
N, 16) using
M(
L/
N, 16):
where
,
,
, and
. Then, matrix
W(
i,
j) was transformed to obtain the given
R2 and
Kd, calculated using the following formulas:
Here,
p1(
i) is the probability of symbols in
S1, which is
N/
L for any
i;
, where
p(
l) and
p(
m) are the probabilities of the
l or
m type nucleotides in
S (
l,m ∈ {a,t,c,g});
p(
l)
= q(
l)
/L, where
q(
l) is the number of
l type nucleotides in
S; and
L is the length of
S. For all the calculations, we used
R0 = 110
L/
N and
K0 = −1.8 [
32]. The matrix transformation procedure is described in detail in [
25] and an example for the original matrix of five columns and
R2 = 155 is shown in
Table 1 (a small matrix is used to fit in full in
Table 1). The transformed matrix has
R2 = 2000 and
Kd =−1.5.
It is necessary to transform matrix
W so that the distribution function for
F(
L,
L) is similar for different
W matrices. Therefore, the same
R2 and
Kd values should be maintained. The optimal cost for an insertion or deletion (
del, formula 5) depends on
R2 and the average value of
F(
L,
L) for a random sequence
S depends on
Kd. The distribution function
F(
L,
L) can be obtained by the shuffling of sequence
S into 1000 random sequences denoted as set
SR and calculating
F(
L,
L) of matrix
W for each sequence of the set. If
R2 and
Kd for different transformed
W matrices are the same, then
F(
L,
L) distributions will be equal or similar [
25], which is the objective of matrix transformation.
After obtaining one transformed PWM (WT) for set Q, we shuffled sequence S and repeated the calculations using Formulas (1) and (2). As a result, about 106 PWMs for set Q were selected and then filtered to leave only those that uniformly fill the Euclidean space with dimension 16 L/N. We considered each matrix as a point in this space and calculated the Euclidean distance D between all matrices. We selected the threshold D0 so that set Q contained less than 103 matrices. Any two matrices with D < D0 were excluded from the set. We denoted the number of matrices in set Q as n1.
2.3. Comparison of Set Q with Sequence S Using Dynamic Programming
Then, we aligned sequence
S with each of the matrices from set
Q using the global alignment algorithm. The
F value was calculated as:
where
n =
s(
k) + 4(
s(
j) − 1)),
i and
j each ranges from 2 to
L, and
s(
j) and
s1(
i) are elements of sequences
S and
S1, respectively. Parameter
n reflects the fact that in matrix
W, dinucleotides were taken into account. To determine
n, the previous position (
k), already included in the alignment, should be defined. It was calculated from the created transitions in matrix
F, depending on the previous base in
S and used to obtain
W(
s1(
i),
n). If the previous base of sequence
S is
s(
j −
t), then
k =
j −
t and
n =
s(
j −
t) + (
s(
j) − 1) × 4. Three cases should be considered. In the first,
t = 1 corresponds to the movement along the main diagonal of matrix
F and there is no deletion in sequence
S in the alignment (
Figure 3A). In the second,
t > 1 corresponds to a deletion of
t − 1 bases in sequence
S (illustrated in
Figure 3B for
t = 2). Finally, deletions may occur simultaneously in both sequences
S and
S1, which correspond to deletions of columns in matrix
W. If the previous symbol in sequence
S1 has the number
i − 1, then there is no deletion, but if this number is
i −
t (
t > 1), then there is a deletion of
t − 1 bases in sequence
S1. For these transitions, we did not consider the correlations of adjacent bases and took
n =
s(
j). Rather than matrix
W(
s1(
i),
n), we used matrix
W1(
s1(
i),
s(
j)):
In this case, the correlation of adjacent bases is not considered, which is quite acceptable when the number of deletions is relatively small (illustrated in
Figure 3C for
t = 2).
The zero row and column of matrix
F were filled with negative numbers,
F(0,
j) and
F(
i, 0) were 0 for
i and
j ranging from 1 to
L, respectively, and
F(0, 0),
F(1, 0), …,
F(2, 0) were also equal to 0. Matrix
E(
x,
n) was used to define the first column and row of matrix
F. The insertion/deletion penalty value (
del = 25.0) was selected based on our earlier work [
25]. The reverse transition matrix was filled along with matrix
F. Therefore, we aligned sequences
S1 and
S using the reverse transition matrix and determined
F(
L,
L). The alignment of
S1 and
S was obtained for all matrices from the
Q set. As a result, vector
V(
i) (
i = 1, …,
n1) contained
F(
L,
L) for each matrix.
2.4. Application of the Genetic Algorithm to the Q Set
To optimize matrices from the
Q set, we used a genetic algorithm described in our previous study [
25]. The aim was to change each PWM from the
Q set to maximize
F(
L,
L), which was considered an objective function.
F(
L,
L) for each matrix was put into vector
V(
i) (
i = 1, 2, …,
n1), which was sorted in the ascending order from
V(1) (the minimum) to
V(
n1) (the maximum) and the matrices in the
Q set were arranged accordingly. Then, two matrices were randomly selected with the probability of choosing a matrix, which increased with the increase of
i from 1 to
n1, and the two matrices were used to create a “descendant”, for which any element of the first matrix was selected with an equal probability. Then, rectangles were randomly selected to the right and left above and below the selected element in the first matrix with the probability of 0.25 and the elements within the rectangle were moved from the first to the second matrix to create a descendant, which replaced the PWM with
V(1).
Then, we introduced mutations in 10% of the randomly selected matrices from the
Q set. To do this, a randomly selected element of the matrix was changed to a random value in the range from −10.0 to +10.0. Usually, less than 10
4 cycles were required to achieve the moment when
V(
n1) did not increase, i.e., to reach the maximum designated as
maxV(
n1). However, in rare cases, more than 10
5 cycles were performed. At the output of the algorithm (
Figure 1), we obtained
maxV(
n1), two-dimensional alignment of sequences
S1 and
S, and matrix
maxW, which were used to compute the alignment.
2.5. Calculation of Statistical Significance for maxV(n1)
We used the Monte Carlo method to estimate the statistical significance of
maxV(
n1). Sequence
S was randomly shuffled to obtain 200 random sequences. Then, matrix
maxW was included in the
Q set described in
Section 2.2, which allowed taking into account the effectiveness of
maxW alignment with random sequences. Then, each of these sequences were treated as described in
Section 2.2,
Section 2.3,
Section 2.4 and
maxW was calculated for each, producing 200
maxV(
n1). Then, the mean
and variance
were calculated and used to compute
Z:
where
maxV(
n1) was calculated for sequences
S1 and
S in
Section 2.4.
Z was obtained for each MSA and the average Z value for random
S sequences was estimated according to Formula (7) after each sequence had been subjected to the procedures described in
Section 2.2,
Section 2.3,
Section 2.4 and here. As a result, the mean
Z = 1.8, and we can assume that the MSA is non-random at Z > 6.0.
2.6. MSA Construction
The MSA was computed using the two-dimensional alignment of sequences S1 and S. Each position in sequence S1 corresponded to a column in the MSA. Any insertion in the two-dimensional alignment (opposite to which there was a gap in sequence S1) resulted in an additional column in the MSA.
2.7. Comparison of Various MSA Methods
The algorithm shown in
Figure 1 can also be applied to determine the statistical significance of MSAs created by other algorithms. Let us denote the MSA as
A, the length of each sequence in
A as
K, and the number of sequences as
N. All sequences from
A are linked to produce sequence
S3 of length
L =
KN. Then, the PWM is calculated for
A using Formula (2), transformed using Formulas (3) and (4), and applied to create the two-dimensional alignment for sequence
S3 using Formulas (5) and (6) and to calculate
F(
L,
L). The statistical significance of
A is then computed according to Formula (7).
However, the columns that have a sum of elements < N/2 should be excluded from A to eliminate redundant deletions in the calculation of F(L, L), whereas those with the sum > N/2 cannot be excluded since it would lead to an excessive number of insertions. Consequently, the number of columns became K′ ≤ K, resulting in a new alignment A′ (K′ is the length of each sequence in A′). To construct the PWM using A′, frequency matrix M(K′, 16) was first calculated using Formula (1) and then the PWM (designated as WA’) was calculated using Formula (2). Formulas (3) and (4) were applied to transform the resulting matrix and obtain matrix WTA’, which was used to calculate F(L, L) (L = K′N) based on A’. For this, the sequence from A’ was merged with sequence S4 with all the spaces preserved. At the same time, sequence S5 containing column numbers {1, 2, …, K′} of the WTA’ matrix repeated N times was created. Then, we determined the sum of F1 = F1 + WT(s5(i),n), where n = s4(i − 1) + (s4(i) − 1) × 4 was calculated for all i from 2 to L = K’N, for which s4(i − 1) and s4(i) were not gaps, whereas for those i for which s4(i − 1) was a gap, the sum was calculated as F2 = F2 + E(s5(i),s4(i)). Matrix E was calculated from the WTA’ matrix using Formula (6). We also calculated F3 = −k1del, where k1 was the number of gaps in alignment A’, and del was the insertion/deletion penalty (Formula (5)), as well as F4 = −k2del, where k2 was the difference in the number of nucleotides between alignments A and A’. Finally, we calculated F(KN’, KN’) = F5 = F1 + F2 − F3 − F4.
Weight matrix
WTA’ is the image of alignment
A’, for which statistical significance can be estimated based on the effectiveness of the alignment between the
WTA’ matrix and random sequences. If the alignment is random, then matrix
WTA’ would be random too and
F5 would be close to the value obtained for random sequences (
Section 2.2).
Then, sequence
S4 was randomly shuffled to create 200 sequences and matrix
WTA’ was included in the
Q set as described in
Section 2.5. Each of the 200 sequences were treated as described in
Section 2.2,
Section 2.3,
Section 2.4. As a result, 200
maxV(
n1), each for a different random sequence, were obtained and used to calculate the mean
maxV(n1) and variance
. Then, we calculated
Z using Formula (7), where
F5 was used rather than
maxV(n1). The MSA constructed by different mathematical methods, including MAHDS, had the same algorithm for calculating
Z, which allowed their comparison based on
Z values (
supplementary material 1).
2.8. Algorithm for the Classification of Promoter Sequences from the A. thaliana Genome
The MAHDS algorithm developed in this study was applied to align promoter sequences from the
A. thaliana genome (downloaded from
https://epd.epfl.ch//index.php [
33]). Each promoter had length
K (600 nt), which included the region from −499 to +100 bp relative to the first base of the start codon (position +1). There were 22,694 promoter sequences in the analyzed set denoted as
PM (
supplementary material 1). Since the algorithm shown in
Figure 1 requires considerable resources to align all the promoter sequences, we created a sample containing 500 randomly chosen promoters, which were combined into one sequence
S with
L = 500 × 600 = 30,000 nt. Then, we constructed the MSA as described in
Figure 1 and
Section 2.1,
Section 2.2,
Section 2.3,
Section 2.4,
Section 2.5,
Section 2.6 and obtained
mV(
n1), two-dimensional alignment of sequences
S1 and
S, and PWM
mW(600, 16).
However, the volume of the PM set was significantly larger than the 500 randomly selected promoters included in sequence S. Furthermore, promoter sequences from the PM set might not show statistically significant alignment with maxW(600, 16). Therefore, we aligned each promoter from the PM set with matrix maxW(600, 16) using Formula (5) and considering the promoter sequence as S with L = 600. As a result, F(L, L) for each promoter from the PM set was calculated and put into the Ves(i) vector (where i is the promoter number).
Then, the promoter sequences with statistically significant Ves(i) were selected from the PM set. To do this, we used PMR(i) sets obtained by random shuffling of the promoter sequence with number i; each PMR(i) set contained 103 random sequences of 600 bp. We aligned each sequence from PMR(i) relative to the maxW(600, 16) matrix, calculated F(L, L) denoted as Vesr(j) (j = 1, 2, …, 103), and then determined the mean Ves(j) and variance D(Vesr) and calculated Z for each Ves(i) using Formula (7). If Z > Z0, then the promoter was considered to have a statistically significant alignment with the maxW(600, 16) matrix. For Z0 = 5.0, the probability of random similarity between the promoter and maxW(600, 16) was about 10−6. All promoter sequences with Z > 5.0 were assigned to the same class characterized by the maxW(600, 16) matrix.
When we created the first class of the
A.thaliana promoter sequences in this way, we removed all the sequences with Z > 5.0 from the
PM set and created
PM(1) set. The resulting set
PM(1) was used to create further classes. The described procedure was repeated for the
PM(1) set, from the creation of a new set of 500 randomly selected promoters. As a result, we created a second class of promoters and a
PM(2) set. We repeated this procedure for the sets
PM(
i),
i = 1,2, …. Each iteration created a new class and the corresponding
maxW(600, 16) matrix. If on some iteration, the volume of the PM(
i) set became less than 500 sequences, then we chose all the sequences for carrying out the multiple alignment. The multiple alignments generated for each class are shown in
Supplementary material 1. The procedure was stopped at the iteration
i =
i0 when the size of classes with
i >
i0 was less than 100 sequences. We defined the size of classes equal to 100 based on the random sequence analysis. When we performed the procedure on randomly shuffled promoter sequences (total number is 22,694), the volume of the classes ranged from 6 to 27 sequences with an average value of 16 sequences. This means that with using the threshold we kept the type I error rate less than 16%.
2.9. Divergence of Dinucleotide Positions in the Created Promoter Classes
We examined the difference in dinucleotide frequencies among the constructed MSAs based on the expected frequencies for each of the 599 positions (from −499 to +100) in the promoter regions. For this, we used the MSA obtained for sequence
S in each class (see
Section 2.8). Formula (1) was used to fill in the
Mk (600,16) matrix, where
k is the class number from 1 to 25 (see
Section 3.2). The frequencies of nucleotides
f(
j) (
j = 1, 2, 3, 4 for a, t, c, and g, respectively) and probabilities
p(
j) =
f(
j)/(
f(1) +
f(2) +
f(3) +
f(4)) were calculated for
S. Then, we calculated the expected dinucleotide frequencies as
t(
i,
j) =
Np(
i)
p(
j), where
N is the number of promoters in the constructed MSA, base
i is observed at position
l − 1, and base
j is observed at position
l in the MSA with class
k. Finally, we calculated variance
D(
i,
j)
= Np(
i)
p(
j)(1
− p(
i)
p(
j) and then
Zk(l, n) = (Mk(l, n) − t(i, j))/D(i, j) for each element of the
Mk(l,n) matrix, where
n ranged from 1 to 16 (
n =
i + 4(
j − 1)) and
l—from 2 to 600. Therefore, we could access the difference between the frequency of a base pair (
i, j) at position
l in the MSA and the expected frequency.
Zk(l, n) is the normal approximation for the binomial distribution. The larger
Zk(l, n), the greater the difference between the observed and expected frequencies.
To estimate the conservation of position l, we used sum , which follows the χ2 distribution with 15 degrees of freedom. We transformed χk(l) into an argument of normal distribution using the normal approximation for chi-squared distribution , where the number of degrees of freedom (f) is 15. As a result, function χk(l) was obtained for multiple alignment k, where l ranges from 2 to 600. The greater χk(l), the more conserved is position l in the MSA.
4. Discussion
In this work, we developed the MAHDS algorithm for MSA and created a server for its application available at
http://victoria.biengi.ac.ru/mahds/main. MAHDS can perform the alignment for nucleotide sequences with a high degree of diversity, which cannot be done with the other method, including ClustalW [
27], Clustal-Omega [
28], T-Coffee [
17], Kalign [
29], and MAFFT, which align sequences that accumulated no more than 2.5 substitutions per nucleotide (
x < 2.5). In this study, we used default parameters, however, we also tried other settings. Therefore, we examined the statistical significance of aligning model sequences by ClustalW using eight different values of gap open (Go) penalties in the interval from 2 to 30 and gap extension (Ge) penalties from 0.1 to 25% for each Go value. For
Z = 5.0, the maximum
x was 2.4 and in default conditions,
x was 2.1 (
Figure 4), which means that variations in Go/Ge penalties can increase
x by ~15%, which is within the error shown in
Figure 4. To estimate the error in calculating
Z, we used different
G(
x) sets (
Section 3.1) generated from different initial sequences. It is quite unlikely that a heuristic algorithm ClustalW that progressively builds MSA from a series of pairwise alignments could produce results with
Z > 5.0 for
x > 2.5. Replacing the PAM matrix with BLOSUM did not significantly affect the results shown in
Figure 4 and all changes were within the error shown there. For the other methods used in
Section 3.1, the increase in
x also did not exceed ~15%.
In contrast to the other algorithms, MAHDS calculates statistically significant MSA for x < 3.7; furthermore, if the number of sequences to be aligned increases to 500, then the limit for x increases to < 4.4. We believe that such a capability of MAHDS is very important for MSA of different genes and regulatory sequences and could provide higher precision in their annotation. To align 100 sequences of 600 nt each by the MAHDS method, we used a computer cluster with 64 computing cores (eight Ryzen 7 1700 processors) and the alignment took less than 6 min.
It is usually a challenge to perform the alignment of promoter sequences due to their considerable dissimilarity [
45]. Despite the large number of promoters, it has not been possible to obtain a statistically significant multiple alignment of their sequences [
46]. By using MAHDS, we could construct a statistically significant multiple alignment for ~ 78% of the known promoter sequences from the
A. thaliana genome. The estimated average number of substitutions per nucleotide (
x) in the created promoter classes was 3.7, whereas the other methods provided statistically significant MSAs at
x < 2.5, which explains why such MSA of promoters has not been obtained previously. Therefore, MAHDS could perform multiple alignment of sequences with a low degree of similarity, i.e., those that have accumulated a considerable number (3.7) of substitutions per nucleotide, which enabled us to classify promoters and reveal some common properties within each class at a statistically significant level. However, the developed MAHDS algorithm, similar to the existing methods for computing MSA (MAFFT, T-Coffee, Kalign, Muscle, ClustalW, and Clustal-Omega), is purely mathematical and, therefore, can only reveal structural similarities among the aligned sequences but not their biological significance. At this point, it is difficult to conclude whether the revealed patterns are inherent to promoters or are also present in other sequences. Similarly, it is unclear whether the observed sequence similarity is associated with a common evolutionary origin of promoter regions or with the general functional role of promoters in gene transcription. These problems should be addressed in special studies, where the promoter classes and MSAs obtained here would be used to identify promoter sequences in various genomes and correlate them with experimental results, which can be done using HMMs [
47,
48].
The search for potential promoters in various genomes is a major challenge. The currently available predictive algorithms such as TSSW [
45], PePPER [
49], and G4PromFinder [
50] use the existing mathematical methods, which do not provide a statistically significant alignment of diverse sequences. The best algorithms can predict one false positive point in 10
3–10
4 DNA bases. As a result, it is not possible to distinguish the true promoter from false hits. For the promoter prediction, it will be convenient to use
maxW(600, 16) matrices calculated here and the mathematical algorithm described in
Section 2.3 as various successively cut genome fragments can be considered as sequence
S. We believe that a number of false positive hits will be several hundred or even a few dozen per genome size of 3 × 10
9 bases.
In this work, we used MAHDS to construct the multiple alignment of promoter sequences. The constructed alignment can be used for the subsequent prediction of promoters in genome sequences and to improve the accuracy of these predictions. In this case, to reduce the number of false positives it is important to get MSA with the highest statistical significance (as we wrote above). However, if MAHDS is used for other purposes, it can be important to estimate the biological correctness of the constructed MSA. Methods developed for the evaluation of MSA accuracy and correctness utilize structural information about corresponding proteins [
51]. However, here we consider DNA sequences for which, according to our estimate,
x ~ 3.7. To our knowledge, there are no biologically correct MSA constructed for the set of all promoter sequences. As well as there are no such MSA for any DNA sequences with
x ~ 3.7. The point is that both pairwise and multiple alignments could be constructed only for sequences with
x < 2.5, as shown in
Figure 4,
Figure 5 and
Figure 6. Therefore, in this work, we focused on the statistical significance of the constructed alignments, which is a common statistical technique. We believe that a biologically correct alignment using MAHDS for
x > 2.5 can be constructed in the future if one can find amino acids or DNA sequences with a similar structure but with sequences’
x ~ 3.5–3.7. Then, it will be possible to improve the accuracy of the MAHDS algorithm in the same way as it was done for other methods [
51]. However, even the present form of MAHDS can be used to predict promoter sequences, as we wrote in the paragraph above.
In our future research, we will focus on the development of the software for protein MSA based on the same method. For protein sequences, the number of possible amino acid pairs is 400 and a large number of sequences (over 2000) would be required to fill in the maxW(L/N, 400) matrix, where L is the total sequence length, N is the number of sequences, and L/N is the average length of sequences. Therefore, it will be problematic to use this approach for small samples (less than 100 sequences), when the statistically significant filling of the maxW(L/N, 400) matrix is not possible. We intend to apply a series of optimization procedures to address this issue.