1. Introduction
In the age of Big Data, with datasets being collected in almost all fields of human endeavor, there is an emerging economic and scientific need to extract useful information from it. Thus, machine learning algorithms have become indispensable. One machine learning technique is feature selection [
1]. It arises from the need of determining the “best” subset of variables for a given problem. The use of an adequate feature selection method can avoid over-fitting and improve model performance, providing faster and more cost-effective learning models and a deeper insight into the underlying processes that generate the data. Features can be categorized in three ways: relevant, irrelevant and redundant. As a result, selecting the relevant features and ignoring the irrelevant and redundant ones is advisable.
The process of feature selection is typically performed on a machine using high numerical representation (64 bits). Using a more powerful processor provides significant benefits in terms of speed and capability to solve more complex problems. Although this capability does not come without cost; a conventional microprocessor can require a substantial amount of off-chip support hardware, memory, and often a complex operating system. In contrast to up-to-date computers, these requirements are often not met by embedded systems, low energy computers or integrated solutions that need to optimize the used hardware resources. With the power demand of smartphones, health wearables and fitness trackers, there is a need for tools that enable energy consumption estimation for such systems. Thus, we identify one such opportunity to develop a feature selection algorithm in embedded systems without reducing performance. This opportunity leverages the observation that algorithms yield parameters which can achieve performances close to that of optimal double-precision parameters by simply limiting the amount of bits. In this work, we investigate feature selection by considering the information theoretic measure of mutual information with reduced precision parameters. The mutual information measure is used due to its computational efficiency and simple interpretation. Therefore, we are able to provide a limited bit depth mutual information, and, through minimum Redundancy Maximum Relevance feature selection method, experimentally achieve classification performances close to that of 64-bit representations for several real and synthetic datasets.
2. Limited Bit Depth Mutual Information
In information theoretic feature selection, the main challenge is to estimate the mutual information [
2]. To calculate mutual information we need to estimate the probability distributions. Internally, it counts the occurrences of values within a particular group. Thus, based on Tschiatschek’s work [
3] for approximately computing probabilities, we investigate mutual information with limited number of bits by considering this measure with reduced precision counters. To perform the reduced precision approach, we target a fixed-point representation instead of the 64-bit resolution.
Mutual Information parameters are typically represented in the logarithm domain. For the reduced precision parameters, we compute the number of occurrences and use a lookup table to determine the logarithm of the probability of a particular event. The lookup table is indexed in terms of number of occurrences of an event and the total number of events and stores values for the logarithms in the desired reduced precision representation. Following the fixed-point representation, and to limit the maximum size of the lookup table and the bit-width required for the counters, we assumed some maximum integer number. After calculating the cumulative count, in order to guarantee that the counts stay in range, the algorithm identifies counters that reach their maximum value, and halves these counters.
3. Experimental Results and Conclusions
Our limited depth mutual information can be applied to any method that uses internally the mutual information measure. We have chosen to do it within feature selection since with the advent of Big Data, feature selection process has a key role to play in helping reduce high-dimensionality in machine learning problems. There is a large number of feature selection methods that use mutual information as a measure, thus their performance depending on the accuracy obtained by the mutual information step. Among the different feature selection algorithms based on mutual information, the mRMR (minimum Redundancy Maximum Relevance) multivariate filter [
4] is used due to its popularity and good results in the machine learning area.
Experimental results over several synthetic and real datasets have shown that 16 bits are sufficient to return the same feature ranking than that of double precision representation. Besides, classification results showed that even using a 4-bit representation, our limited bit depth mutual information was able to achieve performances very close to that of full precision mutual information. As a result, meaningful computational, runtime and memory benefits will be provided when implementing mutual information in embedded systems.