1. Introduction
Influenza is a viral respiratory infection that is usually seasonal in nature. Influenza viruses are divided into types A, B, C and D, each belonging to a genera of the family
Orthomyxoviridae [
1,
2]. The hosts and transmissibility of different types of influenza viruses vary, with type A being highly infectious to humans and animals [
2]. In March 2009, a type A influenza virus emerged in Mexico and the United States, which was later tagged as H1N1. The name H1N1 stands for the viral type 1 haemagglutinin and type 1 neuraminidase. According to the official report of the World Health Organization [
3], it had spread to over 200 countries and regions, resulting in 17,843 deaths by 28 March 2010, while it is estimated that the actual number of deaths far exceeds this number. The origin of the H1N1 influenza virus is believed to be a reassortant virus that had been circulating in pigs, and it is likely that swine-to-human transmission began to occur several months prior to the outbreak [
4].
The influenza viral genome consists of segmented negative-strand RNA relying on viral-originated RNA polymerase for replication [
1]. A typical type A influenza virion contains eight RNA strands, three encoding the viral RNA polymerase, one encoding the haemagglutinin, one encoding the neuraminidase, one encoding the nucleoprotein which bounds the viral genome, one encoding the matrix protein and membrane protein and one encoding the nuclear export protein and the non-structural protein 1 (NS1) [
5]. NS1 is the key toxigenic protein of influenza viruses such as H1N1. It acts by inhibiting post-transcriptional processing of cellular pre-mRNA and shutting off cellular protein synthesis without affecting viral protein synthesis [
6]. Mutations in the NS1 gene are associated with H1N1 virulence [
7] and has the potential to serve as an indicator of H1N1 evolution.
H1N1 is also believed to be the pathogen of the 1918–1919 pandemic that affected the world, killing an estimated 50–100 million people [
8]. Although it turns out that the H1N1 strain prevalent in the 2009 pandemic was less lethal than that in the 1918–1919 pandemic, concerns emerged that subsequent mutations in H1N1 could trigger higher lethality due to the widespread transmission and mutation-prone nature of RNA viruses [
9]. Especially with the profound lesson of today’s SARS-nCOV-2 pandemic, the importance of timely tracking of mutations in pandemic viruses is more recognized. In order to properly define the different strains of viruses and determine whether newly identified viruses evolve from a baseline strain, it is helpful to develop “minimal models” of viral genomes. We conceptually define a minimal model of a cluster of virus genomes as a representative benchmark of all viral genome sequences in the cluster. One may define the minimal model of a genome cluster as the longest common subsequence (LCS) of all genome sequences in the cluster, while there can be various definitions corresponding to different clustering algorithms.
To date, only one previous study has analyzed the genome from the perspective of the minimal model, which treats the bacterial genome as the evolution result of the minimal model during random mutation and replication [
10]. To the best of our knowledge, there is no known work that uses the minimal model of the genome as a baseline for genome comparison and clustering. Current methods for comparison and clustering of genomes are usually divided into alignment-based methods represented by multiple sequence alignment (MSA) [
11] and alignment-free methods represented by natural vector (NV) [
12]. In this work, we adopted three distinct algorithms to develop the minimal model of H1N1 viruses in different districts based on their NS1 gene sequences, including two alignment-based methods and one alignment-free method. The first method is to find the LCS of NS1 gene sequences of all viruses in the district to be studied. As is commonly known, there exists a dynamic programming algorithm with
complexity to compute the LCS of two sequences, but computing the LCS of multiple sequences is a non-deterministic polynomial-time hard (NP-hard) problem [
13]. We thereby conducted some pre-processing on the data and adopted some heuristics. The second method is based on a mature MSA algorithm named Clustal Omega [
14]. The third method is on the basis of the alignment-free method. We applied NV encoding to the sequences. NV is a Euclidean space vector representation of gene sequences, which maps gene sequences of different lengths to vectors of the same dimension. NV is widely used in phylogenetic analysis of genetic sequences [
12], sequence comparison [
15], tracing of emerging pathogens [
16] and mutation distribution analysis [
17], and it also provides geometric perspective for genome sequence description [
18]. NV has the benefit of preserving distances to a certain extent when mapping gene sequences into Euclidean space, allowing for high-speed sequence comparison.
We evaluated these algorithms in a five-fold cross validation classification scheme. First, we regarded viruses from every district as a cluster. Only districts with at least five sequences were retained. Then, 80% of the sequences in every cluster were randomly chosen as the training set to compute the minimal model of each class, while the remaining 20% of sequences were used as the test set. The process was repeated five times, each time using different training set and test set. Finally, the classification accuracy on the test set using the average weighted F1-score scoring index was computed. Through our work, the feasibility of genome classification by minimal models was verified. By the five-fold cross validation scheme, we compared the accuracy of different algorithms. We also compared the time complexity of different algorithms in generating the minimal model and predicting the class of genome. Our results have provided a reference on the choice of algorithms for constructing minimal models of genomes and applying minimal models for genome clustering.
2. Materials and Methods
2.1. Dataset
A total of 18,211 H1N1 NS1 sequences belonging to 845 districts were retrieved from the National Center for Biotechnology Information [NCBI,
https://www.ncbi.nlm.nih.gov (accessed on 12 April 2022)] with keywords “H1N1” and “NS1”, and 3169 sequences from 148 districts were kept. The number of sequences in each district was between 5 and 50. We assumed that viruses in the same district shared similar NS1 sequence features, so the minimal model of viral NS1 sequences in each district could serve as a representative and predictive baseline for tracing viruses with new sequences.
2.2. Longest Common Sequence-Based Minimal Model and Distance
To find a minimal model of a group of sequences, the starting point was to find their LCS. Since the multiple sequence LCS problem is NP-hard, there is no polynomial time algorithm to find the LCS at present. In addition, the LCS of a group of sequences is sensitive to outliers: if the features of a sequence are very different from those of other sequences, the resulting LCS will be greatly affected by the sequence. To overcome these problems, a heuristic approach described below was adopted.
For sequences a and b, we measured their dissimilarity by , where denotes the length of a sequence, and denotes the LCS of two sequences, which can be computed in polynomial time. For a group of sequences with size M, we randomly selected 5 sequences from the group and computed the corresponding matrix D. denotes the dissimilarity between the i-th randomly selected sequence and the j-th sequence in the sequence group. All sequences whose index j satisfied were considered as outliers and were not selected, as we considered these sequences as outliers in the group because at least 4 out of 5 randomly selected sequences have dissimilarities greater than 100 of them. For the remaining sequences, we randomly selected a sequence as the starting point and compared it with the remaining sequences one by one in a random order. Only the LCS of the two comparison sequences were kept at a time. We repeated the one-by-one comparing process five times and took the longest common sequence as the minimal model of the group. This process aimed to eliminate the difference between the LCS generated by one-by-one comparison and the global LCS generated by the random comparison order.
To calculate the distance between two minimal models, the Levenshtein distance
, which is defined as the minimum number of conversions required to convert one sequence to another, was used as the distance function. Permitted conversions include inserting a character, deleting a character and substituting one character by another. The Levenshtein distance of sequence
a and sequence
b, denoted as
, can be computed by dynamic programming according to the following recursive formula:
Here,
equals 1 if
, otherwise 0;
denotes the sub-sequence between index
i and index
j of
a (including
i but not
j; the index is 0-based). For example, if
, then
. The end of the recursion
is equal to
.
2.3. Multiple Sequence Alignment-Based Minimal Model and Distance
Multiple sequence alignment is usually used in sequence comparison. One of the purposes of MSA is to align multiple biological sequences by filling gaps (“-”) at proper locations in the sequences to generate the minimum number of non-homogeneous sites. An example is given in
Figure 1.
There is no precise definition of “minimum number of non-homogeneous sites”, and all MSA algorithms are heuristic. Clustal Omega is one of the latest algorithms which works by first performing a pairwise alignment using a
k-tuple approach, then a clustering sequence by the mBed method and the
k-means method, followed by constructing guide trees using the UPGMA method and finally performing a progressive alignment using the HHalign package [
19].
In our experiments, we used the Clustal Omega 1.2.4 software package with default parameters to perform MSA, and the minimal model was defined on the basis of the consensus sequence. For each position in the aligned sequence, if a base (A/G/C/T) appeared in at least
of the sequences, we appended the base to the corresponding position of the minimal model; otherwise, we dropped the position. For instance, in the example given by
Figure 1, the minimal model was “AGCTGCCA”, where the 2nd,
and
positions were dropped because no base appears in
of the sequences. The same as in the previous subsection, we used the Levenshtein distance
L to compute the distance of the newly given sequence and the existing minimal model and predict the district of the new gene sequence.
2.4. Natural Vector-Based Minimal Model and Distance
The
k-mer natural vector is an encoding approach to map gene sequences into a Euclidean space [
12];
k-mer is one of the
sub-sequences of length
k with each position chosen from {A, T, G, C}. A gene sequence can be mapped to a
-dimensional vector. Given a sequence
S, the first
entries of the vector are
, where
denotes the number that
k-mer
i occurs in the sequence. The second
entries of the vector are
, where
is defined as the arithmetic mean of the distances from all ith
k-mers to the first base in the sequence. If
,
is defined to be zero. The last
entries of the vector are
, where
is defined as the second order normalized central moment of the distances from all
k-mer
i’s to the first base in the sequence. Particularly,
Here,
l is the length of the gene sequence, and
is the distance between the
jth
k-mer
i to the first base. Finally, the
-dimensional vector
, called natural vector (NV), is normalized to have Euclidean norm 1 by dividing each element with the Euclidean norm of the original vector, and we use
to denote the encoding process:
One can consider higher order moments to obtain higher dimensional NV encoding, but extending the NV with higher-order moments has proved of little use [
12] and thus here we considered up to the second order moment.
For two NVs
and
, their distance
d can be calculated by the cosine similarity:
where
is the Euclidean norm, and
is the natural inner product. The biological distance of two sequences can be represented by the mathematical cosine similarity of the two corresponding NVs.
The minimal model of a group of sequences based on NV is a “central point” which has a minimal sum of distances to the other sequences’ NVs (in the sense of cosine similarity). This problem has no analytical solution, so we applied an iterative algorithm [
20] described as follows. First, the weight of each point is initialized as 1. Then, the weighted mean of all points as the central point is computed and normalizes the central point to Euclidean norm 1 iteratively. The weight of each point is updated to the exponential of their negative distances to the central point. When the distance of the central points computed in adjacent iterations is lower than the given convergence threshold, or the number of iterations reaches a predefined maximum value, the iteration is terminated. The maximum iteration time was set to be 10,000 and the convergence threshold to be
in our experiment. This “central point” was considered as the minimal model of the group of sequences. It did not necessarily have a corresponding gene sequence but could still serve as a baseline for comparison with other sequences according to the cosine similarity between NVs.
2.5. Five-Fold Cross Validation Classification Scheme
All algorithms (LCS-based, MSA-based and NV-based) were evaluated and compared under the five-fold cross validation classification scheme. We repeated the following steps five times; each step used different test sets. First, we randomly chose 80% of the sequences in each district as the training set to generate a minimal model for that district. Then, we used the remaining of the sequences as the test set. For each sequence in the test set, we first computed the distance between each minimal model and the sequence. The distance computation was based on a properly defined distance function. Then, we predicted that the the sequence belonged to the district whose corresponding minimal model had the minimum distance to the sequence.
Formally,
was used to denote the collection of all 148 districts:
. Each district contained
H1N1 NS1 gene sequences:
, where
belongs to the sequence space
, which contains all finite strings consisting of characters “A”, “T”, “G” and “C”. In each fold, a district
was divided into a training set
and a test set
, where
,
and
. The collection of subsets of
was denoted as
. Each algorithm applies an encoding function on sequences, which maps a sequence into an encoding space
:
For the the MSA-based algorithm, the encoding space
is
itself, and the encoding function
e is just the identity map. For the
k-mer NV-based algorithm, the encoding space
is the Euclidean space
.
The minimal model generating function is an algorithm whose input is a set of sequences and output is an element in the encoding space:
The output represents the set of sequences to some extent.
The distance function is also an encoding map based on a gene sequence pair:
The classification function
is defined based on the distance function:
If there are multiple districts whose minimal models have the same minimal distance to sequence
S,
takes the minimal index among them. The classification performance is measured by the weighted average F1-score:
where
,
and
are the true positive number, false positive number and false negative number of the
i-th class, respectively. Under different criteria, they can be computed differently. Criterion
t is denoted as saying a sequence is predicted correctly if the minimal model of the district is the
t-th closest to all districts’ minimal models with respect to the distance function
d.
is the number of correctly predicted sequences in
.
is the number of sequences
S satisfying
in
.
is the number of incorrectly predicted sequences in
. After randomly selecting different training sets
and test sets
each time and repeating all steps 5 times, we computed the arithmetic mean of the 5 weighted average F1-scores as the evaluation of the classification algorithm.
4. Conclusions and Discussion
In this study, we proposed a formal definition and implementation of genome minimal model construction and verified its feasibility for application to genome classification. We applied three algorithms on generating minimal models using the H1N1 NS1 dataset. In order to quantitatively evaluate the minimal models generated by different algorithms, a five-fold cross validation classification scheme was designed. Under this scheme, the LCS-based, MSA-based and NV-based algorithms were tested to generate minimal models and predict the district of newly given sequences. The algorithm based on MSA won in accuracy, while the algorithm based on NV generated the minimal model in the minimum time and maintained predictive power. The
k-mer NV encoding method involves a parameter
k, and we reached the same conclusion as Wen, J. et al. did [
12] about the best choice of
k in
k-mer NV encoding.
Although the NV-based method is slightly less accurate than the MSA-based method in the task of virus district prediction, the former has several advantages that are worth discussing, and some of these advantages are also reflected in other applications of the NV encoding method. Firstly, according to our results of time complexity comparison of generating minimal models and previous studies [
22], the NV-based algorithm has higher efficiency compared to alignment-based algorithms. Secondly, the NV-based method can handle genome sequence collections with arbitrary diversity, while MSA-based algorithms can produce potentially unreliable results due to high variability among genomes [
23]. Another advantage of the NV-based method is that it can quickly update the minimal model based on newly added sequences. Once a new sequence is detected in a district, it only assigns it an average weight and adds it to the iterative process described in
Section 2.4, which can quickly converge to a new NV minimal model representation. As a comparison, the MSA-based method requires re-running the time-consuming MSA algorithm over the entire set of sequences for any sequence joining or change [
22]. In addition, since a vector in the continuous Euclidean space does not necessarily correspond to a real sequence in the discrete sequence space, the NV-based method produces a minimal model representation that can achieve finer granularity than the MSA-based method and the LCS-based method.
This work regards the minimal model as a baseline for classification, comparison and virus tracing. Compared to general supervised classification methods, such as the support vector machine (SVM), this method provides better interpretability and renewability because the classification based on the minimal model provides information about the distance between the sequence and the minimal model of each class, and the addition of samples only needs to update minimal models of the involved classes. Moreover, one can generate different minimal models at different classification levels. We only used the H1N1 epidemic district as a classification criterion, while one can also use other classification criteria (for example, the viral subspecies, epidemic period, symptoms caused, etc.) to generate the minimal model of all kinds of virus genomes, and then perform evaluation using the cross validation framework under the hierarchy corresponding to the classification criteria. If there are enough samples for each class, the minimal model can be used to quickly classify viruses with newly discovered genomes and make predictions about any attribute of interest as long as the attribute is a classification criterion for the minimal model generation. Although the NV-based method achieved higher accuracy than the MSA-based method for other tasks, such as phylogenetic analysis [
12], the NV-based method is slightly less accurate than the MSA-based method for the task of classifying genomes based on minimal models, implying that the NV-based method is still in the process of being improved.