1. Introduction
Metagenomics studies and studies of polyploid organisms are two areas of research in genomic analysis that are forced to deal with datasets containing high variation among the individual genetic sequences, which negatively affects the data analysis.
Metagenomics studies communities of micro-organisms, like the ones forming the human microbiome [
1] or the microbiome of urban environments [
2], and are the focus of studies of viral evolution [
3]. Understanding of the human microbiome is essential for the future of medicine [
4], is a significant part of nutrition studies [
5], affects human space flight [
6], and the proper study of disease outbreaks [
7]; metagenomics studies are also critical for agricultural development [
8].
From a computational perspective, however, the presence of genetic sequences from multiple similar organisms in the same sample—a feature of great importance from a scientific standpoint—is a big obstacle to the data analysis. The inevitable presence of machine errors in the data, which can be masquerading as legitimate inter-species variation, cannot always be accounted for. To make matters worse, not even the number of organisms in the samples is always known, and it is not uncommon for these samples to be comprised of altogether unknown organisms, as the majority of micro-organisms in the world are not known. The errors cause inaccuracies in the final study results [
9,
10] and are often resolved by discarding reads suspected to contain errors [
11] in their entirety.
Polyploid organisms, most notably the genome of hexaploid wheat [
12], present a similar challenge, albeit a less pronounced one. This is due to the fact that it contains three distinct, yet very similar, subgenomes with three pairs of chromosomes that wheat inherited from three different ancestor organisms [
13]. This is less than the number of genomes present in a metagenomics study, but significantly more than the number of genomes present in the study of any typical diploid organism. As a staple food grown more than any other crop in the world [
14], the importance of understanding its genome is tremendous, yet the analysis has been slowed down due to its hexaploid nature.
There is a lack of good computational tools to filter errors in this type of dataset, as most such tools target individual monoploid or diploid organisms. Artificial intelligence and in particular machine learning (ML) offer a suitable tool to approach this problem, especially in light of the obscure characteristics of the datasets. Models based on ML, such as artificial neural networks (ANN) [
15] and random forests (RF) [
16], can be trained to classify examples based on unknown features of the dataset, expressed either in numerical or non-numerical form. This is possible even when the systematic relationship between the input and the result is not yet known, but there are enough training examples on which the models can be trained to recognize this relationship. Random forests are becoming more and more popular in bioinformatics [
17], where there is a rapid influx of poorly-studied raw data.
This paper extends the earlier published [
18], focusing on creating an ML-based approach, comprised of several identically-trained ML-based models, to classify bases into groups of sequencing errors and correct bases. In addition, it also discusses an earlier approach to discover errors numerically—or, in the case of this paper, error candidates—based on a more complex estimate of the frequency of appearance of individual nucleotide bases across the different genetic sequences and offers this approach as a tool to select candidates for testing with the ML-based model, while also evaluating the ability to use the ML-based models on their own. The details of the ML-based model are extended to include most of the tests performed for the choice of the model, as well as the details on the input construction.
3. Using a Machine Learning Model
The presence of the complex unknown structure inside the genetic chains of the studies species suggested the use of machine learning-based models. Since ML-based models like artificial neural networks are a very powerful tool to detect unknown and unstudied relations between the input variables; they make a perfect candidate to attempt classification of the data, especially after the analytical ways to approach the problem did not seem to yield sufficiently good results.
To select the most appropriate ML-based model, during the study, the same input was provided to different models: an artificial neural network, a random forest [
16] model, as well as various other decision tree and classification models.
3.1. Machine Learning Input
The learning input was composed based on two presumptions:
The frequency of a base in a given column is the most significant factor in whether it might be erroneous, as errors would be among the rarest bases in their position/column.
The frequency of the base among locally-similar sequences is more important than the frequency among locally-dissimilar sequences, making the factor of similarity a good criterion to split the sequences into multiple bins to produce the input values.
For each nucleotide base in each sequence, represented by a single character in the sequence’s string, a learning input example will be constructed. For each such base, the nucleotide sequences from the dataset will be placed in multiple bins depending on how similar they are to the evaluated sequence. A triplet of such bins will be created for a pair of equidistant bases neighboring the evaluated one. These auxiliary pairs of bases will serve as the criteria of local similarity between the sequence with the evaluated base and the remaining sequences. For each such pair, the set of sequences can be split into three groups: the sequences that concur with both bases in both positions, the sequences that differ from the bases in both positions, and the sequences that concur with only one of them.
Around each evaluated character in position i, a window of radius w will be constructed inside the evaluated sequence r, containing the neighboring characters of . Then, for each offset inside the window, , the pair of characters will be selected, and three bins will be constructed: the subset of character sequences p for which , the subset of character sequences p for which , and all the remaining sequences, which would differ in either position or , but not both.
The frequency of base at position i will be recorded for each of these three subsets. This frequency is the count of characters for which , among the reads p in the bin, where p is not r, divided by the total number of the sequences in the bin. Thus, we would define . Here, S is the set of sequences found in the bin.
The computed three frequencies will be added to the input for the ML-based models, which will be comprised of a series of such triplets, creating an input vector of length .
Figure 1 shows an example of a learning input to evaluate the middle base T, using a window
w of three bases. For the closest pair of neighboring bases T and G, the entire dataset has been split into three. Among the sequences that contain T and G in both of those positions,
confirms the middle base T; among the sequences that contain only one of them in their position,
confirms the middle base T; and among the sequences that contain neither, only
confirms it. These three values are used as the first three values inside the input vector that will be classified by the ML-based models. This is repeated for the next pair of characters A and T and then for the most distant one: C and A.
Let us assume that there are n sequences of average length m, and we need to compute the input vectors for approximately all m characters inside all n sequences. Without complex, potentially inexact, optimizations, this will have a time complexity of and a space complexity of . The time complexity is reasonable, since would be the time to loop over the learning inputs once.
Because speed was not the main aim of this paper, during the experiments, a slower algorithm with a time complexity of and a space complexity of was used, as it was more straightforward to verify that it constructs the examples exactly as defined here.
3.3. Imputation
Another problem was faced during the input set construction. There was not always data to construct all values inside the input vectors. There were cases where the bins would be empty, because no other sequence but the evaluated one would contain a given base that was used to construct the bin. In such cases where there would be no other sequences in the bin, the computation of would result in a division by zero.
One way to solve this problem would be to provide the nominator and denominator of independently to the ML-based model, but in this work, the choice was made to instead use imputation and put the average value for other bases inside all the bins S having the same offset and number of matches at the offset. Because the computation of that average for all the sequences is resource-intensive, it was computed only for a random subset of a hundred sequences.
The rationale of the imputation of averages chosen over two separate inputs was that the three groups of bins represent subsets that are very different in nature, and it was considered wrong to default to the same value, two inputs of zero, for two bins that have the opposite meaning: one containing matching sequences and the other containing mismatching sequences. Especially when in at least one of those two bins, a default value would be significantly, yet accidentally different from the values commonly seen by the ML model, it was thus considered safest to provide these common average values in places where the value was a total unknown.
Since the models are empirical in nature and judged by their final performance, shortcomings in this approach would be observed in that performance. On the one hand, this means that there is not a need for a strict verification that the particular choice of imputation is appropriate. On the other hand, it provides one potential place where a significant improvement to the model might be possible by significant adjustment. The potential alternatives include two nominator and denominator inputs; inclusion of the evaluated sequence in the counts so that the denominator is never zero; and a different choice of imputation.
Evaluation of these options was not attempted in the work preceding this paper, but they provide one potential avenue to increase accuracy to the point where an ML-based model can be used on its own.
It should be noted that one of the sources for missing values was the exceptionally high variation found in metagenomics, which provided both highly unusual values and large gaps caused by these mismatches where there would be effectively no sequence data from the remaining sequences at all. This was rarely true for the hexaploid wheat genome samples; hence, imputation was used much less there, which may be one of the reasons for the significantly better results observed for wheat.
A note should be made about the reason for the use of only a subset of the reads to compute the averages for imputation. Should all individual bases be turned into examples to be passed through the learning model for evaluation, there would be no performance penalty in computing all averages, as all bases would be fully evaluated anyway. However, training only uses a random subset of sequences, and the additional filtering with a weighted frequency to select good error candidates presented in
Section 4.3 only works with a small number of bases, requiring the full computation of the input values for bases that would never be evaluated by the ML-based model, which can take a significantly long time.
5. Results
To verify the chosen design of the ML input examples, as well as the non-standard way to craft training examples, models constructed using them were tested on metagenomics data and, then, through a later experiment, on hexaploid wheat data.
5.2. Varying the ANN Model Parameters
The results given in
Table 1 were achieved after an initial selection of the ANN model parameters: 30 input neurons (corresponding to 10 pairs of nucleotide bases neighboring the evaluated one) and 16 hidden neurons. This seems to be a reasonable choice: a radius of 10 bases gives enough information about the local variation to the ANN model, which will allow the neural network to determine which part of the datasets that corroborates the evaluated nucleotide is made of relevant sequences. A number of hidden neurons that is approximately half the number of input neurons is a common initial choice, as well. However, ANN models are first and foremost a heuristic and empirical tool, where the results are dependent on experiments, not on some hard set rules, so some experiments supporting the choice of parameters or exploring other potential values have to be conducted.
In our experiment, we first varied the radius of the window around the evaluated base to determine if 10 pairs of nucleotide bases give enough local context to the ANN model, thus varying the number of input neurons away from the initial 30.
Table 2 shows the results when radii of 1–6 (3–18 input neurons) were used. In addition, tests were made with only the outer 2, 4, and 6 pairs of bases inside a 10-base window to test if the closest bases provided the most relevant context.
It is evident that increasing the number of input neurons improved the detection accuracy, particularly the sensitivity. The specificity was also affected. With a radius of one base, or only three input neurons corresponding to the two adjacent bases, the accuracy was substantially weakened; the sensitivity did not become reasonable until at least two base-radii, where six input neurons summarized information about the four neighboring nucleotide bases. Beyond six bases, there was saturation of accuracy, as adding more bases did not lead to a dramatic improvement, demonstrating that inputting 18 neurons is enough.
This is only valid when the dataset-targeted would be used, as the result on a different biological dataset was highly inconsistent. One can presume the possibility of overfitting the model towards the biological features in the training set, a problem that seems less pronounced when a dataset-targeted training is to be used.
It should also be noted that using the data for only distant bases led to a significant and consistent decrease in the accuracy, both on a split of the training set and on a different biological dataset. This confirms the expectations that the nearby bases are the most important and also suggests that the nearest four bases may be enough, but are also necessary to make an accurate determination using this ANN model.
Table 3 shows the effect of varying the number of neurons on the
hidden layer. For this experiment, the input neurons were fixed to 30, and the hidden neurons were increased and decreased from 16 to see if that would affect the result positively or negatively.
The variation of the number of neurons shows that 16 can be regarded as an optimal number of hidden neurons. The varying of their number lead to a decrease in accuracy, and in seven out of eight cases, 16 hidden neurons were better than 15 or 17 hidden neurons. We would not consider these empirically-obtained numbers as a definite mark for successful analysis, but in this particular case, it supported the assumption that 16 hidden neurons was enough to postpone modifications of the hidden layer as a means to improve the results.
5.4. Application of Other ML-Based Models
During the development of a threshold-based analytical approach that was presented in [
23], it was noted that thresholds may be a good way to separate the signal (correct bases) from the noise (errors). Since decision tree models over numerical data rely on the use of empirically-determined thresholds, this inspired us to test a decision tree algorithm and then subsequently test 20 different ML-based models, focusing on the algorithms that use trees. The models were selected among those available in the Weka [
26] AI system using default parameters, and all the tests were also conducted using it.
Table 5 shows the results with all the tree-based models that were selected in the Weka software, as well as other classification models like RIPPER. The table is sorted by
specificity on a different biological dataset, in an effort to select the model that was best suited to be used directly to detect the errors, without the need to preselect error candidates, by finding the highest possible specificity.
The observed results confirmed the expectations that decision trees are also a good tool to approach this problem. Half of the decision tree algorithms produced specificity higher than the ANN when used on a different set of biological data.
The random forest model achieved the highest accuracy overall, with a specificity of over 99.9%. This means less than one falsely-selected error for every 1000 bases. This suggests RF is the best model to be used directly to detect errors in the biological data, without the use of further tools to improve the accuracy. On the contrary, from the results in
Section 5.5, we would learn that the RF model became an ill-suited choice when it
was combined with error candidate pre-selection.
To allow further comparison, when trained on a separate test set, the 60-tree RF model’s F score was and the balanced Matthews correlation coefficient (MCC) , and the ROC Area Under Curve (AUC) reported by Weka exceeded . For a data split of the training set, the F score increased to and the MCC to . Likewise, for the ANN model, the F score on the test set was , the MCC , and the computed AUC . Using the data split of the training set, the F score increased to , and the MCC increased to .
The RF model consistently showed higher F, MCC, and ROC area under curve (AUC) than all the other models in either the test set or a split of the training set, except for RIPPER rule learner. On a separate test set, RIPPER performed with an F score of and MCC of , the highest of all the models. Conversely, the ANN model was outperformed by the majority of the tree models.
The F
score, the MCC, and the ROC area under curve all represent measures that take into account both false positives and false negatives, giving an equal importance to both types of erroneous classification. However, because of the rarity of errors, the error specificity plays a much more important role in this particular task, as the cost of a false positive is significantly higher. This is why
Table 5 is ordered by the specificity instead.
When choosing the most suitable models for the task of error detection, as well as further evaluation, a couple of factors were considered. Several of the tree models, as well as the RIPPER rule learner showed the highest overall accuracy, suggesting them as the most suitable pick for the task altogether. Between RIPPER and RF, RIPPER had the absolute highest accuracy, but RF with 60 trees had the highest sensitivity, which—to avoid as many as possible costly false positives—was selected with higher priority for further testing. It should additionally be noted that if these two models are to be used directly, without a filter like the one proposed in
Section 4.3, they seem to be the most suitable choice.
On the other hand, if the filter from
Section 4.3 is applied to further improve the specificity, the combination may alter the accuracy ranking of the models, because of interactions between the filter and the models. Additionally, the accuracies were measured on error examples at arbitrary positions, not on a test set constructed from real errors, nor errors simulated with a realistic statistical profile matching the one of expected real errors. This may alter the ranking of the models in regards to their
sensitivity, as real errors may, on average, be more or less likely to be detected than arbitrarily-placed ones.
With that in mind, the ANN model was also selected to be tested with a weighted frequency filter as described in
Section 4 and was additionally tested using a probabilistic simulation of errors that matched their expected occurrence in
Section 5.5. Artificial neural networks have two attractive qualities that support that choice: they are significantly different from the tree models, and unlike trees or rules, they provide a model with a complex continuous non-linear relationship between the inputs and the prediction. Not only would this allow us to test if the ranking is preserved when a filter is added, but the complex nature of the ANN model may decrease the likelihood of its selection overlapping with the rather trivial selection of examples proposed in
Section 4.1, allowing the conjunction of the two classifiers to have higher overall accuracy.
Author Contributions
Conceptualization, D.V. and M.N.; methodology, M.K. and D.V.; software, M.K.; validation, M.K., D.V. and M.N.; formal analysis, M.K.; investigation, M.K. and D.V.; resources, D.V. and M.N.; data curation, D.V. and M.K.; writing–original draft preparation, M.K.; writing–review and editing, M.N., D.V. and M.K.; visualization, M.K.; supervision, M.N. and D.V.; project administration, M.N. and D.V.; funding acquisition, M.N. and D.V.
Funding
M.K. and D.V. acknowledge financial support by the Bulgarian NSF within the “GloBIG: A Model of Integration of Cloud Framework for Hybrid Massive Parallelism and its Application for Analysis and Automated Semantic Enhancement of Big Heterogeneous Data Collections” Project, Contract DN02/9/2016. M.N. acknowledges financial support by the European Regional Development Fund and the Operational Program “Science and Education for Smart Growth” under Contract No. BG05M2OP001-1.001-0004-C01 (Project “Universities for Science, Informatics and Technologies in the e-Society (UNITe)”).
Conflicts of Interest
The authors declare no conflict of interest.
References
- Nelson, K.; White, B. Metagenomics and Its Applications to the Study of the Human Microbiome. In Metagenomics: Theory, Methods and Applications; Horizon Scientific Press: Poole, UK, 2010; pp. 171–182. [Google Scholar]
- The MetaSUB International Consortium. The Metagenomics and Metadesign of the Subways and Urban Biomes (MetaSUB) International Consortium inaugural meeting report. Microbiome 2016, 4, 24. [Google Scholar] [CrossRef] [PubMed]
- Kristensen, D.; Mushegian, A.; Dolja, V.; Koonin, E. New dimensions of the virus world discovered through metagenomics. Trends Microbiol. 2010, 18, 11–19. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Allen-Vercoe, E.; Petrof, E.O. The microbiome: What it means for medicine. Br. J. Gen. Pract. 2014, 64, 118–119. [Google Scholar] [CrossRef] [PubMed]
- Kau, A.L.; Ahern, P.P.; Griffin, N.W.; Goodman, A.L.; Gordon, J.I. Human nutrition, the gut microbiome, and immune system: Envisioning the future. Nature 2011, 474, 327–336. [Google Scholar] [CrossRef] [PubMed]
- Saei, A.A.; Barzegari, A. The microbiome: The forgotten organ of the astronaut’s body–probiotics beyond terrestrial limits. Future Microbiol. 2012, 7, 1037–1046. [Google Scholar] [CrossRef] [PubMed]
- Karlsson, O.E.; Hansen, T.; Knutsson, R.; Löfström, C.; Granberg, F.; Berg, M. Metagenomic Detection Methods in Biopreparedness Outbreak Scenarios. Biosecur. Bioterrorism Biodef. Strategy Pract. Sci. 2013, 11, S146–S157. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, R.W. (Ed.) Metagenomics and Its Applications in Agriculture, Biomedicine and Environmental Studies; Nova Science Pub Inc.: Hauppauge, NY, USA, 2010. [Google Scholar]
- Kunin, V.; Engelbrektson, A.; Ochman, H.; Hugenholtz, P. Wrinkles in the rare biosphere: Pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ. Microbiol. 2010, 12, 118–123. [Google Scholar] [CrossRef] [PubMed]
- Valverde, J.; Mellado, R. Analysis of Metagenomic Data Containing High Biodiversity Levels. PLoS ONE 2013, 8, e58118. [Google Scholar] [CrossRef] [PubMed]
- Huse, S.; Huber, J.; Morrison, H.; Sogin, M.; Welch, D. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007, 8, R143. [Google Scholar] [CrossRef] [PubMed]
- Brenchley, R.; Spannagl, M.; Pfeifer, M.; Barker, G.L.; D’Amore, R.; Allen, A.M.; McKenzie, N.; Kramer, M.; Kerhornou, A.; Bolser, D.; et al. Analysis of the bread wheat genome using whole-genome shotgun sequencing. Nature 2012, 491, 705–710. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Marcussen, T.; Sandve, S.R.; Heier, L.; Spannagl, M.; Pfeifer, M.; The International Wheat Genome Sequencing Consortium; Jakobsen, K.S.; Wulff, B.B.H.; Steuernagel, B.; Mayer, K.F.X.; et al. Ancient hybridizations among the ancestral genomes of bread wheat. Science 2014, 345, 1250092. [Google Scholar] [CrossRef] [PubMed]
- United Nations, Food and Agriculture Organization, S.D.F. Crops /World Total /Wheat /Area Harvested. 2014. Available online: https://web.archive.org/web/20150906230329/http://faostat.fao.org/site/567/DesktopDefault.aspx?PageID=567 (accessed on 6 September 2015).
- Rojas, R. Neural Networks: A Systematic Introduction; Springer: Berlin, Germany, 1996. [Google Scholar]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Qi, Y. Random Forest for Bioinformatics. In Ensemble Machine Learning; Zhang, C., Ma, Y., Eds.; Springer: Boston, MA, USA, 2012; pp. 307–323. [Google Scholar] [Green Version]
- Krachunov, M.; Nisheva, M.; Vassilev, D. Machine Learning-Driven Noise Separation in High Variation Genomics Sequencing Datasets. In Proceedings of the Artificial Intelligence: Methodology, Systems, and Applications (AIMSA 2018), Varna, Bulgaria, 12–14 September 2018; Agre, G., van Genabith, J., Declerck, T., Eds.; Springer: Cham, Switzerland, 2018; Volume 11089. [Google Scholar] [CrossRef]
- Li, W.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658–1659. [Google Scholar] [CrossRef] [PubMed]
- Katoh, K.; Kuma, K.; Toh, H.; Miyata, T. MAFFT version 5: Improvement in accuracy of multiple sequence alignment. Nucleid Acid Res. 2005, 33, 511–518. [Google Scholar] [CrossRef] [PubMed]
- Miller, J.R.; Koren, S.; Sutton, G. Assembly Algorithms for Next-Generation Sequencing Data. Genomics 2010, 95, 315–327. [Google Scholar] [CrossRef] [PubMed]
- Gilles, A.; Meglécz, E.; Pech, N.; Ferreira, S.; Malausa, T.; Martin, J.F. Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genom. 2011, 12, 245. [Google Scholar] [CrossRef] [PubMed]
- Krachunov, M.; Vassilev, D. An approach to a metagenomic data processing workflow. J. Comput. Sci. 2014, 5, 357–362. [Google Scholar] [CrossRef]
- Krachunov, M.; Nisheva, M.; Vassilev, D. Machine learning models in error and variant detection high-variation high-throughput sequencing datasets. Procedia Comput. Sci. 2017, 108C, 1145–1154. [Google Scholar] [CrossRef]
- Laver, T.; Harrison, J.W.; O’Neill, P.; Moore, K.; Farbos, A.; Paszkiewicz, K.; Studholme, D.J. Assessing the performance of the Oxford Nanopore Technologies MinION. Biomol. Detect. Quantif. 2015, 3, 1–8. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Witten, I.H.; Frank, E.; Hal, M.A. Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed.; Morgan Kaufmann Publishers: San Francisco, CA, USA, 2011. [Google Scholar]
- Hulten, G.; Spencer, L.; Domingos, P. Mining time-changing data streams. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August 2001; ACM Press: New York, NY, USA, 2001; pp. 97–106. [Google Scholar] [Green Version]
- Hoeffding, W. Probability Inequalities for Sums of Bounded Random Variables. J. Am. Stat. Assoc. 1963, 58, 13–30. [Google Scholar] [CrossRef] [Green Version]
- Cohen, W.W. Fast Effective Rule Induction. In Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995; Morgan Kaufmann: Burlington, MA, USA, 1995; pp. 115–123. [Google Scholar] [Green Version]
- Kirov, K.; Krachunov, M.; Kulev, O.; Nisheva, M.; Vassilev, D. Reducing false negatives for errors in SNP detection using a machine learning approach. Comptes Rendus de l’Académie Bulgare des Sciences 2016, 69, 155–160. [Google Scholar]
- Schröder, J.; Schröder, H.; Puglisi, S.J.; Sinha, R.; Schmidt, B. SHREC: A short-read error correction method. Bioinformatics 2009, 25, 2157–2163. [Google Scholar] [CrossRef] [PubMed]
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).