Consensus string is a significant feature of a deoxyribonucleic acid (DNA) sequence. The median string is one of the most popular exact algorithms to find DNA consensus. A DNA sequence is represented using the alphabet Σ= {
a,
c,
g,
t}. The algorithm generates a set of all the 4
l possible motifs or
l-mers from the alphabet to search a motif of length
l. Out of all possible
l-mers, it finds the consensus. This algorithm guarantees to return the consensus but this is NP-complete and runtime increases with the increase in
l-mer size. Using transitional probability from the Markov chain, the proposed algorithm symmetrically generates four subsets of
l-mers. Each of the subsets contains a few
l-mers starting with a particular letter. We used these reduced sets of
l-mers instead of using
4ll-mers. The experimental result shows that the proposed algorithm produces a much lower number of
l-mers and takes less time to execute. In the case of
l-mer of length 7, the proposed system is 48 times faster than the median string algorithm. For
l-mer of size 7, the proposed algorithm produces only 2.5%
l-mer in comparison with the median string algorithm. While compared with the recently proposed voting algorithm, our proposed algorithm is found to be 4.4 times faster for a longer
l-mer size like 9.
Full article