1. Introduction
The research and development of data compression methods based on stochastic processes consider the frequencies of the occurrences of the elements of the underlying alphabet to produce a code. Data compression methods are designed to transmit and store data, see [
1]. The Huffman coding method is an icon in this field since it is optimal to compress genetic data, text files, etc. Due to the speed with which knowledge about DNA-related structures evolves, the volume of information has grown dramatically, making it impossible to cover all the information on relevant phenomena without a strategy that allows compressing a large number of sequences currently available in databases, such as the free source, The National Center for Biotechnology Information Advances Science and Health (NCBI, accessed on 24 December 2021,
https://www.ncbi.nlm.nih.gov/).
The data compression process built by Huffman coding incorporates the use of probabilities to produce an optimal code. This article seeks to improve the Huffman code’s performance about (i) how many probabilities are needed and (ii) how the probabilities are estimated. Obtaining an efficient compression method requires a deep understanding of the behavior of the database. Then, the representation that promotes compression must be generated from an appropriate stochastic model that accurately represents the database. Such notions run in parallel since they appeal to the minimum information necessary to retrieve a message or to the minimum description of a model necessary to design the behavior of a database. While in compression’s field the goal is to use a reduced number of bits to encode the message, in modeling, the goal is to identify an accurate model for the representation of the message. Then, we can think about the premise of identifying an optimal method for compression that uses an optimal stochastic model to produce the data compression. A series of aspects arises when classifying a stochastic model as optimal, plausibility, complexity, etc. In this paper, we consider these aspects and we treat them from the perspective of model selection and data compression. This problem is also related to model selection’s Minimum Description Length principle (see [
2]). This principle established that the best model for a given dataset is the one that produces the shortest encoding when considering the encoding of the model plus the encoding of the dataset using the model.
A message, a DNA sequence, among other data structures, can be seen as a realization of a stochastic process, subject to a certain alphabet and with a certain memory. For example, in [
3,
4], it is shown that the genetic structures of SARS-CoV-2, in
FASTA format, can be modeled as a Markovian process. In this case, a model of partitions with interstice was used, which is a variant of the model initially proposed in [
5] (and investigated in [
6]). In those cases, the alphabet is composed of bases,
, where the bases are adenine (
a), thymine (
t), guanine (
g), and cytosine (
c). The dependence structure that exists in genetic sequences receives a description that is given by (i) a representation of the state space, by identifying genetic units, these units are groups of states that compound a partition of the whole state space and by (ii) the set of transition probabilities from each genetic unit to each base of
Throughout this paper, we are assuming that the true model generating the message is unknown and that it will be represented by a Markov chain. According to the Minimum Description Length’s principle, it is necessary to identify a parsimonious model to represent the Markov chain. The goal of this paper is to obtain an improvement in the efficiency of Huffman coding for Markov chains. Huffman’s algorithm needs to build a code for each transition probability function of the Markov process. The Partition Markov models ([
6]) reduce the total number of transition probabilities needed to describe the Markov process. Then, using the partition Markov models together with the Huffman code produces a more efficient description of the model. The motivation is that we can consistently estimate the Partition Markov model to efficiently describe the message generation mechanism. We are also looking for an estimate of the minimum expected codeword length per symbol, which will allow us to validate the combined use of Huffman and partition Markov models as a cost-effective strategy.
The content of this article is as follows:
Section 2 introduces the concepts and classification of coding systems.
Section 3 is destined to give the main notions and way of consistent estimation of the stochastic strategy that we want to show as an option to improve the performance of the Huffman coding process. In addition, in this section, we show results establishing the relation between an optimal coding system and the estimation of a stochastic process.
Section 4 exposes the stages of a message transmission process under the new strategy introduced in the previous section.
Section 5 shows the advantages and disadvantages of the new strategy, compared with the application of the traditional Huffman code, and concludes with the application of the compression system to three SARS-CoV-2 sequences. The conclusions and details of the contributions of this paper follow in
Section 6.
2. Theoretical Background and Optimal Codes
Among the qualities expected of a coding method, we note that an economy in the length of the coded message and an instantaneous method for decoding are desired, such that, it will not be necessary to receive the entire message to start the decoding process. Another aspect of coding is the complexity of the elements used for such a message since, if the complexity is high, the procedure can require high computational and memory conditions. The coding theory, see [
7], organizes the problem by giving specific attributes to the encoding function, which are introduced below, associated with a random variable, and then easily extended for stochastic processes.
Definition 1. Let X be a random variable taking values in the set of finite concatenations of elements of The source code of X, denoted by is a function such that, for each is the codeword of In addition, denotes de length of the codeword
Definition 2. Let be a size b random vector taking values in A block source code of (with block size b) denoted by is a function such that, for each is the codeword of and is the codeword length for the block
For simplicity, in this section, all the properties and algorithms will be given for a source code of a random variable. However, they are valid or trivially extended to the block source code for a random vector as in the Definition 2.
For the case (see Definition 1), records the number of bits used to represent Since X is a random variable, it is possible to determine the expected length of the code Moreover, it is expected to use the fewest number of elements to encode a message, it being necessary to adopt a criterion given by the next definition.
Definition 3. The expected length of the source code C for a random variable X with probability mass function is given by As expected, we want a code that provides a minimum expected length
Necessarily, a C code that does not lead to ambiguity is one that verifies the following definition and such a code is called nonsingular.
Definition 4. A source code under the Definition 1, is a nonsingular code if, given
Sending sequences of realizations of X would imply imposing certain conditions on a code that can be composed by the concatenation of codes of values of a random variable.
Definition 5. Given a source code under the Definition 1, and given a sequence of realizations of say the extension of C on the sequence is being the second term, the concatenation of the codes
Under Definition 5, we set the expected conditions to consider as ideal a source code.
Definition 6. Under the Definition 1, the code C is said to be uniquely decodable if its extension, given by Definition 5, is nonsingular (see Definition 4).
As we can see, Definition 6 can be ideal, but the constraint is that such a code is not necessarily instantaneous. Then, it can be necessary to read the whole codified message to proceed to decode it. This brings us to the notion of prefix (or instantaneous) code.
Definition 7. Under the Definition 1, the code C is said a prefix code if no codeword is a prefix of any other codeword.
According to the various classifications attributed to source codes, the prefix codes are the strictest one (see [
7]), in particular, it is a class that verifies the Definition 6. An example is the Huffman code, as shown in [
7]. Furthermore, the prefix codes are related to Kraft’s inequality, via Theorem 5.2.1 (see [
7]). A prefix code (under Definition 1) is such that
That is, there is a conditioning on the possible values of
Thus, the Kraft’s inequality characterizes the prefix condition. Finding the ideal code (which must be prefix) is related to the minimization of
under the constraint
then, the solution is given by
which implies in a possible non integer solution; thereby, the optimal size will be approximate. The first approximation is to consider the integer part of
for each codeword, say
Consider the entropy of
If
is the expected length of a code following Equation (
3), it verifies the Kraft’s inequality, guaranteeing
and, Theorem 5.2.1 (from [
7]) ensures that there is an instantaneous code with these codeword lengths.
Any instantaneous code of expected length
satisfies
as demonstrated by Theorem 5.3.1 (from [
7]), including the optimal code, with expected length
which in the set of instantaneous codes, verifies
for all
coming from an instantaneous code, while, by Theorem 5.4.1 (from [
7]),
where
For a given distribution, an optimal prefix source code using the Huffman algorithm can be constructed, reproduced here (Algorithm 1), see also [
8]. For simplicity, the algorithm is presented for a variable taking values from 1 to
with
; then, the code is composed by concatenation of elements in
those impositions can be easily eliminated in practice.
Algorithm 1: Huffman Code |
Input: |
1 If |
2 Otherwise, while |
– define
|
∗ |
∗ |
∗ |
– If set and |
– Otherwise, if and if |
– Set |
– Define and return to 1.
|
Output: |
Where denotes the cardinal of the set The canonical Huffman code (CHC) is a modification of the Huffman code (Algorithm 1), such that the only information needed to decode it is the set of codelengths.
Consider, again, the random variable X taking values on the alphabet Assume that the probabilities are not increasing such that, if , then This assumption can be made without loss of generality because, if it is not true for there is a permutation of such that has no increasing probabilities.
Let be a Huffman code for with codelengths Letting cal the corresponding CHC, is defined from in the following way:
- (a)
For is the concatenation of zeroes.
- (b)
For assign as the next (larger) binary number. Do this until the first such that the codelength increases, and append zeros to the right of until
- (c)
repeat (b), until
For more information about the canonical Huffman code, see Section 5.8 in [
7].
The following example illustrates the stages followed for the construction of the codes.
Example 1. Consider X a random variable with five values, listed in the alphabet with probabilities We describe below the stages (i–iv) through which the coding process goes (using Algorithm 1).
- i.
- –
and
- –
- ii
- –
and
- –
- iii
- –
and
- –
- iv
- –
Obtaining the following Huffman code for X:
To find the canonical Huffman code, we need to re-code X (column 1 of Table 1) so that the probabilities are not increasing in lexicographic order, obtaining The Huffman code for Y and the corresponding codelengths are obtained directly from C (see columns 5 and 6 of Table 1). From the codelengths of we obtain the CHC, see column 7 of Table 1. Observe that the Huffman code and the canonical Huffman code share the same codelengths. The algorithm strategy is optimal, since, according to Theorem 5.8.1 (see [
7]), if
C is the Huffman code, or the CHC, and
another code that verifies the Definition 6 (uniquely decodable),
where
and
are the lengths of
C and
respectively. That is,
C offers the smallest expected size in the class of decodable codes. This means that elements less used in the encoding of the message occupy a greater number of bits than those used more by the message. It should be noted that optimality is not a guarantee of uniqueness, what Theorem 5.8.1 of [
7] tells us is the first condition and not the second one.
In order to extend the idea to stochastic processes, we first introduce the appropriate notation and, after that, we adapt the coding process. Let
be a discrete time order
o Markov chain on a finite alphabet
with
is the state space of the process
Denote the string
by
where
In this context, the elements that define the process are the transition probabilities, for each
and
define,
Then, we have a set of probabilities to be identified, say
According to this new context, it is necessary to implement the coding for each as we show in the next example.
Example 2. Consider a process with memory and alphabet with transition probabilities given by Table 2 (left): Suppose that the message to be coded is ‘’. As the process has memory , we do not have (transition) probabilities to use the canonical Huffman code on the first o elements. The first o elements are codified with the binary number corresponding to their lexicographic positions in the alphabet. In this case, and the first letter in the sequence is “c”, which corresponds to the third letter in the alphabet; as a result, the first element in the sequence has code 11 corresponding to 3 in binary.
The result is the string which follows the strategy to read the string from left to right. c is coded by 11 (3 in binary), means that c is followed by then using Table 2a is coded by 10, means that a is followed by then a is coded by 11, etc. To decode the sequence we need to consider the order ; then, the first element is decoded as c (third letter of the alphabet); now, we use the conditional line associated with c given in Table 2, so, if the symbol to decode has code 10 and previously we had then the original letter in the second position was In this stage, we use the conditional line associated with a; then, because the next symbol is 11, the letter with that conditional code is a, and so on. The example shows that the instantaneous coding is applicable to each conditional space of the process. That is, the process’ code is a collection of conditional codes. In addition, is the collection of conditional lengths for the set of codes. That is, to identify the conditional codes , and , we consider the transition probabilities and and, proceeding as in the previous example, we obtain , and Furthermore, this example shows how the initial part of the message that has not yet reached the minimum memory of the process can be treated. There are other ways to deal with the initial elements of a message, not necessarily like the one shown in the example, but such a question does not impact the contributions of the article. The procedure encodes one element at a time, with an optimal code depending on the past. It can easily be extended to b consecutive elements of the sample at a time, with an optimal Huffman code depending on the past. To calculate the conditional Huffman code for a block of b symbols, we first need to calculate the conditional transition probabilities for sequences of b letters of the alphabet.
Let
be a discrete time order
o Markov chain on a finite alphabet
and state space
with transition probabilities,
Consider now
the transition probability
for the block
given the state
This probability can be calculated from the set of transition probabilities
in the following way:
where
is the concatenation of the elements in positions
of the sequence resulting of the concatenation of
This is done for each
to obtain
After that, the set of transition probabilities for size
b blocks is used to calculate
, the conditional block Huffman code for the state
s for each
applying the Huffman algorithm to the set of probabilities
The Huffman code for blocks is relevant since Theorem 5.4.2 ([
7]) claims that the expected code length per symbol, for a size
b block Huffman code, converges to the entropy of the process, which is the theoretical minimal code length per symbol. For more information, see [
9,
10,
11], and Section 5 in [
7]. In our simulations, we will use blocks of size 4 to encode the data.
In the next section, we present the strategy for estimating transition probabilities, in a real message or data set. We also present the Partition Markov Model (PMM) that allows us to design the proposal for reducing the number of bits in the transmission of a message, using canonical Huffman encoding.
3. Stochastic Estimation Strategy and Optimal Coding
In practice, we need to estimate the transition probabilities; then, let be a sample of the process and We denote by the number of occurrences of the state s followed by a in the sample ; this is In addition, the number of occurrences of s in the sample is denoted by with Thus, for each and is the estimator of
A reliable estimation of probabilities allows a precise coding, since the proposal of the Algorithm 1 depends on an adequate quantification of the probabilities, to develop the method of attribution of codes in an accurate way. In search of such precision, the idea of partitioning the state space is incorporated, associated with a stochastic process. The proposal aims to use different states, all elements of one specific part of that partition, to estimate the same probability. Let be a partition of and define If we define Then, meaning that we use all the states to estimate the same parameter. However, which is the strategic partition? We formalize the notion in the next definition.
Definition 8. Let be a discrete time order o () Markov chain on a finite alphabet and state space
- 1.
are equivalent if
- 2.
is a Markov chain with partition if this partition is the one defined by the relationship introduced by item 1.
This way to represent a stochastic process is called a Partition Markov Model.
Given Definition 8, it is clear that there will be a reduction in the number of probabilities necessary to describe the Markov chain in comparison with the full Markov chain that is to say that the cardinal is smaller than or equal to the cardinal In order to obtain previously, it is necessary to identify which leads us to the estimation process that is detailed next. Once the structure is identified, we can calculate the probabilities of the parts using them in Algorithm 1, for encoding the message.
In a similar way to the one mentioned at the beginning of this section, the number of occurrences of elements in L followed by a and the total number of occurrences of elements in L are given by Then, the estimation of is given by , which shows the benefit in terms of precision to calculate such probability.
The model given by Definition 8 is studied in [
6]. The challenge is to find a way to produce a strong consistent estimation of
In that paper, the proposal is to use the Bayesian Information Criterion (BIC), see [
12], to achieve this goal. The BIC is formulated as follows:
with
is the pseudo maximum likelihood given a partition
with sample
The logarithm (ln) of the pseudo maximum likelihood term can also be written as
It is shown in [
6] that, by maximizing the BIC, it is possible to obtain a strong consistent estimate of the partition
In practice, this maximization is attainable because of a characteristic of the BIC criterion that allows the definition of a BIC-based metric between candidate parts (see [
6]). The metric is used in a clustering algorithm, like the one proposed in [
5], to produce a consistent estimation of
It is relevant to note that the BIC criterion allows the consistent estimation in other models of the area of inference in processes, see [
13,
14].
Remark 1. The Partition estimation problem studied in [6] is beyond the scope of this paper, but it is important to note that the estimation algorithms require to compute the frequency ratios, The number of frequency ratios should be much smaller than n to have confidence in the calculations. In practice, this restricts the maximal memory o admissible for the model used to compress the data, In Section 5 (situation 6) with we see an example of a model too large for the sample size. Now, and within the framework of stochastic processes, we establish the link between the length of a code and the entropy of the process, as established by Equation (
5), for random variables.
The notion of entropy plays a fundamental role in the theory of codes and, as we will show below, it allows us to identify the expected codeword length per symbol of a message. For a stochastic process
the entropy is given by
Using some basic properties of entropy, see [
7], it is possible to prove the following result. The result allows for identifying the entropy of the process, under the partition structure given by Definition 8.
Theorem 1. Let be a discrete time order o () Markov chain on a finite alphabet and state space with partition following Definition 8. Then, Proof. By Theorem 4.2.1 in [
7],
(
8) follows from Theorem 4.2.1 in [
7], (9) and (10) are consequence of the process having order
o and for being stationary (see Equation (4.25) of [
7]), (11) follows from the definition of conditional entropy. □
Under the assumptions of Theorem 1, and given a sample
of the process
, we can estimate the entropy of the process by
where
is the estimation of
Define
we obtain from Theorem 1
Defining
we build the plug-in estimator of
That is, the estimation of the partition
under the effect of
allows us to identify an estimator of the entropy of the process, which leads us to ask if we can estimate an expected code size. For random variables, there is a relationship between
and the expected length of the optimal code, see Equation (
5). For such adaptation, consider an instantaneous block source code for the random vector
(Definition 2), with codelength function
Let
be the expected codeword length per symbol,
According to Theorem 5.4.2 ([
7]), the minimum expected codeword length per symbol for an optimal block code
verifies
Then, we obtain the next corollary.
Corollary 1. Under the assumptions of Theorem 1, the minimum expected codeword length per symbol satisfieswhen , where is given by Equation (14) and computed for an optimal block source code. Remark 2. If we consider the estimator of given by Equation (12), can be approximate by which is proportional to the first term of the BIC value of Equation (6), using the estimated partition and with a negative sign. The compressed representation of the dataset will be composed of two parts—first, a description of the model used to code the data, and second the coded data. The minimum expected codeword length per symbol refers to the encoding of the data. Remark 2 shows that, for given data, the minimum expected codeword is smaller for models with larger values for the maximum likelihood.
Remark 3. A model with redundant parameters will require more bits for its description than the correct model. On the other hand, a model with redundant parameters usually has a maximum likelihood value larger than the correct model, and, because of Remark 2, will produce a shorter expected code length for the data than the correct model. We will see in Section 5 that using models with redundant parameters for compression produces larger files because the decrease in the size of the portion of the compressed output corresponding to the encoding of the data are smaller than the increase in the size of the part of the output with the specification of the model. The strategy that we investigate in this paper is supported by Definition 8 and the BIC criterion, whose tendency is to decrease the number of probabilities necessary for the encoding of a message compared to the use of a full Markov chain. While in the case of a full Markov chain such probabilities are the same does not happen when approaching the problem from the perspective of a PMM, since the total number of probabilities depends on the cardinal of estimated representing a total equal to and In practice, if a full Markov chain is used, the encoding of the message occurs by Algorithm 1, provided as inputs However, if a PMM is used, the encoding of the message is given by Algorithm 1, provided as inputs where is the estimate of
In the next section, the purpose is to describe the process of transmitting a message. The first subsection shows to the reader the steps that need to be considered. Later, we present the advantages of using the counts related to a PMM instead of using the counts related to a full Markov process.
4. Codification of the Model Information
In this section, we show how we encode the model and parameters. For simplicity and clarity of exposition, we will assume that the receiver knows the alphabet and there is no need to encode A in the message to be transmitted. In the case of an order o Markov chain, the model is determined by the order In the case of a PMM, the model is determined by the partition of the state space. In both cases, the parameters of the model are the estimated transition probabilities. In our case, as the transition probabilities are estimated from the message to encode, the parameters will be the transition frequencies in the case of an order o Markov chain, and in the case of a PMM. Note that this is not a review of the ways to implement the coding/decoding process, since our goal is to show how we can incorporate and take advantage, in the compression process, of the notion of partition (Definition 8).
4.1. Codification of the Transition Frequencies
To send a coded sequence of numbers, we need a protocol so that the receiver knows where the code of each number starts and finishes. If we send the integer k in binary, the receiver has to know how many bits we are using to encode that number. For example, the integer 5 can be encoded as 101 or as For any integer the minimal number of bits needed to represent k in binary is equal to bits. We know that and ; this means that each frequency, on any possible model for the data, can be encoded in bits. In what follows, all the frequencies will be encoded with a fixed number of bits. For example, to send 30 frequencies, it will be used bits.
4.2. Codification of the Model
In the case of an order o Markov chain model, the model is determined by the order We will assume that the order is smaller than ; this means that it can be written in bits or less. We will reserve bits for the order o of the model.
In the case of a PMM, to send the model, we need to send the partition, that is, the relationship between the states and the parts In the next example, we show how to do that through the canonical Huffman code.
Now we detail our proposal. If is the estimated partition for the state space such that , then we associate to each state an index if The states are arranged in lexical order, and this list is replaced by the indices previously assigned. As each has a cardinal, which is the total number of states contained in it, say each i is associated with a probability and the canonical Huffman code is applied to these frequencies. Finally, the index list is replaced by the canonical Huffman code list and communicated by a string (say C(Index)) in the transmission process. Example 3 illustrates the process.
Example 3. Consider the process over the alphabet state space with order Suppose that under the Definition 8 the partition is identified, with the composition given by Table 3. Considering the lexical order of the elements, we associate to each state s the value if
Then, the list of indices is 12211231 (see Table 4), and each index can be coded by the canonical Huffman algorithm, since we have associated with each index a frequency. See the results reported in Table 5. As a consequence, to indicate to which part each state belongs, we need to codify the sequence 12211231
with the codes in column 4 of Table 5. The string to communicate is given by (15),which requires 12 bits. To inform the codification, we need to send the codelengths. In this case, we send the first codelength (1) and then the increments (last column in Table 5). The string is which requires 3 bits. In total to inform the partition, we need to communicate 15 bits, plus bits for the order. The next example complements the previous one showing how using a PMM (see Definition 8 and
Table 3) can cause a reduction in the number of bits necessary for the transmission of the model.
Example 4. Consider that the partition of Table 3 was generated from a size sample, with the frequencies represented by Table 6, as the information that generates the partition reported in Example 3. First, suppose we want to transmit the model and parameters using an order 3 Markov chain. To communicate the model, we need to transmit the order using bits. In order to communicate the counts reported in Table 6, we use the strategy previously described. We use a fixed number of bits to represent each For example, with we can produce the binary representation of with bits for each and Then, the string that is communicated is built from the concatenation in lexical order of the representations of each count, fixing first the state and plugging all the representations following the lexical order of the elements of the alphabet:Then, the string requires bits. Adding the 3 bits for the order of the model, we get the total of 147 bits. In conclusion, in the case of using an order 3 Markov chain model, the total number of bits required for the specification of the model plus the parameters is 147 bits.
Now, suppose that we want to transmit the model and parameters using a partition Markov model. Considering the partition structure of Table 3 plus the frequencies in Table 6, we have the frequencies in Table 7: Then, under this new scheme that considers the counts related to the parts, this is the string that needs to be communicated is the concatenation of the binary representations of the six frequencies in Table 3, written using bits for each and The strings isThe number of bits that the configuration needs is bits. From the concatenation of strings 011 corresponding to the order, plus the string (15) corresponding to the codification of the partition, and (17) corresponding to the frequencies, we obtain the final string, encoding the partition Markov model, In addition, the string requires 69 bits.
Table 8 summarizes the results under two perspectives, using a regular order 3 Markov chain versus the approach by partition: We see then, from the previous example, that the compression produced by the approach (Definition 8) allows for reducing some costs if there is a reasonable number of parts that allows for reducing the information to be transmitted, in comparison with the model with all the states separated, or with a shy reduction.
In the next section, we show simulation studies and an application illustrating the use of Huffman code allied to partition Markov models, and we describe its performance in each situation. We compare the performance of the Huffman coding under two strategies, that is, we compare the performance of the Huffman code when it is implemented using a full order o Markov chain versus using a partition Markov model.
6. Conclusions
In this paper, we propose an improvement in the Huffman coding system’s efficiency, taking into account that the method uses transition probabilities that are computed from a message to be transmitted. A message has a dependent structure for which the element
a, that is, a letter of the alphabet,
which is in the position
t that occurs depending on the past,
o values of the message,
(
the state space). The Huffman coding system uses the transition probabilities
to compress the message without loss. To improve the Huffman coding system in a Markov chain setting, we introduce the notion of Partition (see Definition 8,
Section 3), which is a parsimonious structure on the state space that allows using several states to identify the same probability
where
L contains several states. The proposal is to reduce the number of bits needed to transmit the frequencies that estimate the transition probabilities associated with the message.
Section 4.2 shows and exemplifies the strategy.
On the other hand, there is a potential loss of efficiency due to needing to identify which part
L each state
s belongs to, and, depending on the context, the number of bits used for such transmission must be considered. These situations are investigated in
Section 5. We conclude that the strategy, via Definition 8, compensates, since a model with redundant parameters will use more bits for the description of the model than the correct one. We report in
Section 5 that, for all the situations studied, the partition strategy (
by Parts) improves the compression concerning a full Markov chain
by States. In addition, except for the two cases in which the sample size is similar to the number of transition probabilities, we need to estimate to choose a model, the strategy
by Parts has a compression rate significantly larger than 1.
Using Theorem 1, we show that the entropy of a process can be expressed using the structure of a partition. By estimating such a model, we can estimate the entropy of a process, which is the generator of a message, see Equation (
12). A partition Markov model is identified through the Bayesian Information Criterion (see Equation (
6)), showing a clear relationship with the minimum expected codeword per symbol (see Corollary 1), which can be estimated using the Bayesian Information Criterion and the estimated partition
see Remark 2. These results allow us to estimate the minimum expected codeword per symbol through the estimation process of a partition Markov model.
We complete this article with a compression application in the framework of SARS-CoV-2 sequences (see
Section 5.2). We see that the procedure proposed in this article is capable of producing a reduction of more than 10% in compression in relation to Huffman code without the incorporation of the partition Markov model notion. The strategy shows promise in times of need for extra agility in the transmission and modeling of genetic sequences, as is the case of this application.
As a legacy of this article, we can say that the development of new models, such as the Partition Markov model, as well as estimation methods with relevant properties, such as the Bayesian Information Criterion, provide enough substance to contribute in the field of data compression, and this article aims to show this fact.