The continuous growth of Information Technologies and services have made it possible to store, process and transmit data of different types. This results in large and growing sources of data about natural, industrial and business processes as well as scientific, social and political activities among many others. Furthermore, the collected data will keep growing at an ever-expanding rate with the rollout of disruptive technologies such as the Internet of Things (IoT), in which a plethora of sensing devices will collect and transmit data from their environments to the Internet, or Big Data, where huge amount of data must be manipulated by several tasks at different stages with diverse requirements. In particular, it is forecasted that there will be around 40 billion of connected IoT devices generating about 80ZB of data in 2025 [
1]. The large volumes of data stored, for instance, in central cloud entities on the Internet turns out to be of value if meaningful information could be extracted for producing value-added services and supporting knowledge-based decision-making systems. In this regard, a remarkable interest in data-mining methods has arisen in recent decades, in which case the goal is to examine datasets with diverse properties to efficiently extract patterns describing the behavior of a study phenomenon [
2,
3,
4]. Such data-mining solutions face several challenges mainly because of the particular characteristics of modern real-world datasets, which can be very large in volume of data, and often with different data types in terms of both numerical and non-numerical data. To gain insight into large datasets, data-mining solutions rely on computational models and methods to mimic the ways by which human make decisions relying on patterns and inferences derived from analyzed data [
5,
6,
7]. These methods are known as Machine Learning (ML), which encompasses a variety of scientific disciplines such as Statistics [
8], Learning Theory [
9], and Computer Science [
10]. Due to the huge amount of available data, an important task in ML is to select an appropriate dataset that represents the experience and knowledge about the problem to be learned and solved by a computer system. This task implies to determine the
d most relevant features of such a problem (feature selection [
11,
12]) and to define an appropriate sample of
n d-tuples containing values for those selected features [
13,
14]. Since computers work on a numerical representation of operations and data, most ML algorithms require that these tuples be encoded as a sequence of numerical values. In this context, a
d-tuple is represented as a vector of the form
where
corresponds to the value of the
feature. Formally speaking, a dataset
is generally composed by a set of
N d-tuples. In many situations, though, a feature
can be numerically valued, its value cannot be considered strictly a number because it represents an ordered or unordered category. When this occurs, it said that
is a
categorical feature. A dataset
containing
d-tuples with categorical and numerical features is commonly called a
mixed-type dataset.
1.1. Related Work
When a dataset consists only of categorical features, several approaches attempt to extend the notions of distance, centrality and dispersion to non-numerical features relying on their intrinsic constraints. For example, in clustering problems,
K-modes algorithm [
15,
16,
17,
18] uses a dissimilarity measure to deal with the distance between non-numerical
d-tuples, replacing the means with modes as centrality measure, and using a frequency-based method to update iteratively the cluster centers to minimize a cost function in a similar fashion to
K-means. ROCK [
19] is another approach that computes distances using the Jaccard coefficient [
20] from which, and using a threshold parameter, it is possible to determine the neighbors of any non-numerical
d-tuple. COOLCAT [
21] does not rely in arbitrary notion of distance and instead is based on the notion of entropy. COOLCAT aims to find the clusters that minimize the overall expected entropy.
When the dataset is made up of mixed features, numerical and non-numerical, preprocessing tasks are typically carried out to make non-numerical features amenable to numerical analysis. For example, in regression models [
22], neural networks [
23], or cluster analysis [
24], such a preprocessing consists of encoding categorical features into a set of artificial numerical values (
dummy variables) that take binary values to indicate the presence or absence of some category [
25]. Specifically, categorical features are encoded using a
one-hot encoding scheme, which creates a binary column for each category as it is shown in
Table 1.
Other methods such as
deviation coding,
orthogonal polynomial coding and
Helmert coding [
26], use more values than just zero and one.
An important concern about these methods is the number of dummy variables. A categorical feature with
w different values is represented with
dummy variables [
27]. In this sense, large values of
w might produce an overabundance of them that in turn, might pose performance issues (from computational viewpoint) and, worse yet, side effects such as multicollinearity [
28,
29]. Attempting to deal with this fact, methods such as
feature-hashing are widely used [
30]. In this case, the original feature values are mapped to a smaller set of hash integer values, which in turn, are converted to new feature values via modular arithmetic. However, some of the original values can be mapped into the same hash value (collision), causing noisy data which can produce inconsistent results. Other methods encode non-numerical to numerical values based on the result data (supervised problems) that have proven to be suitable for the situation where the non-numeric feature has many possible values or even when new values appear after the training stage [
31,
32,
33]. In scenarios where the outcome is unknown (unsupervised learning problems), other approaches have arisen. For instance, in [
34], authors describe a method to cluster objects represented by mixed features using a similarity measure first proposed for genetic taxonomy problems. The main idea is to cover both types of data with the same framework considering greater weights to uncommon feature values that match. In [
35], authors present some variants of
k-means algorithm that remove the restriction of processing only numerical features. In [
36] is presented the
k-prototypes algorithm, where the similarity between objects is based on a function of both numerical and non-numerical attributes. In this regard, a new two-terms distance function is proposed. The first term corresponds to the usual square Euclidian distance, whereas the second part is a weighted similarity measure on categorical attributes based on one hyper-parameter and the number of mismatches between an object and a cluster prototype. The hyper-parameter allows a complete control on how much the non-numerical attributes should be taken into account for the final clustering. Using a zero value of the hyper-parameter the algorithm behaves exactly as
k-means without considering non-numerical features. A similar approach is presented in [
37], but the authors based their algorithm on the concept of evidence accumulation. The idea of the evidence accumulation is to combine the results of multiple clusterings into a single data partition, by taking the co-occurrences of pattern pairs (with non-numerical values) in the same cluster as the votes for that association. The data partitions are mapped into a similarity matrix of patterns and then, a Spectral Clustering method is applied. In light of the above ideas, some knowledge discovery systems start to incorporate algorithms to deal with mixed types as part of database mining systems and knowledge discovery systems [
38]. Another interesting idea, reported in [
39], deals with the inclusion of non-numerical variables to Self-Organizing Maps (SOM). In the original SOM, proposed by Kohonen, non-numerical data cannot be handled straightly due to the lack of a direct representation and computation scheme of the distance. Thus, an improvement of the ability of SOM to process mixed data is reported. The main idea is to build a hierarchy graph of the non-numerical values and then measure the distances of the objects according to their distances in the hierarchy.
Given that some ML methods are based on similarities (or disimilarities) measures between pairs of data objects, Gower ([
40,
41]) proposed a distance measure for data with mixed variables, which includes numeric, binary, categorical and ordinal. Given two data objects
and
, Gower’s distance is defined as
, where
is the contribution of the
kth variable to the disimilarity between
and
, and is computed according to the type of the variable. For binary or categorical (nominal) variables,
if
and 0 if
. Numeric variables are considered to be interval-scaled variables, and
, where
is the range of the
kth variable. Ordinal (or categorical ordinal) variables are considered to be numeric variables after been transformed by
, where
is the position rank index and
is the highest rank for variable
k. In its standard form, the weights
of Gower’s distance are set to 1 if both measurements
and
are non-missing, meaning that both objects can be compared on variable
k, otherwise, comparison is not possible and
; however, those weights can have different values related to the importance of variable
k or can be regarded as a function of the result of the variables being compared ([
40]). By applying Gower’s distance formula to all pairs of
n observations (data objects), we get a
symmetric, positive semi-definite disimilarity matrix
, which can be used as an input to distance-matrix-based non-supervised ML algorithms, such as hierarchical clustering; but in the case of algorithms based on data features for supervised and non-supervised ML, it is necessary to find some mapping from disimilarities (which encodes the mix of variables) to a representation amenable to those algorithms. The traditional approach is to use Multidimensional Scaling (MDS, [
42]), which is a set of algorithms to find representations of distances among points (data objects) in a low-dimensional space, starting from the disimilarity matrix by optimizing a convenient stress function ([
43,
44,
45]) in such a way that the distances between points in the resulting configuration correspond closely to the disimilarities given in
. In its simplest form, classical MDS is equivalent to apply Principal Component Analysis (PCA) to the disimilarity matrix, and the mapping is given by the first
p principal components.
Similar to Gower, Gibert et al., [
46] proposed distance metrics for non-homogeneous data which takes into account quantitative (numeric) and qualitative (categorical) variables indexed by the sets
and
, respectively. In this case, the distance between data objects
and
is given by the so-called mixed function:
, where the distance for quantitative variables is defined by
, which is the Euclidean distance normalized by the variance
, and, for qualitative variables, the distance is defined by
, where
is the
distance where a symbolic representation of categorical values is used (see [
46] for details). The weights
, represent the influence given by each group of variables, and some heuristic for choosing these values are proposed in [
46]. An extensive sets of experiments for clustering synthetic and real data sets were presented in [
46,
47], where a hierarchical clustering algorithm is used with the distance metric proposed by Gilbert et al., showing good properties compared with other metrics.
We have mentioned several methods that map from the categorical space to a numerical space in which arithmetic operations and distance measures are meaningful. But these methods may not be as efficient in terms of computational costs and data consistency as supposed. For instance, one-hot encoding has the drawback that it increases the dimensionality of the dataset in function to the cardinality of the categorical values. Methods such as feature-hashing can be much lower than one-hot encoding, though two or more categories can collide when they map into the same hash value, inducing a possible inconsistency in the data. In the case of Gower and Gilbert distances, although it can be very useful when it is used with MDS, it is very limited to small or medium-size datasets, because the solution is given in terms of the number of observations. Those mentioned issues are critical in huge datasets (with hundreds of categorical variables with hundreds of values), such as those for social science, e-commerce, among others, in terms of memory and efficiency because the patterns that we can find in low-dimensional spaces are lost in higher dimensions (the curse of dimensionality). Under these drawbacks, we propose a method that maps the original data (including numerical and categorical features) into a new information-theoretic space wherein all data are amenable to numerical analysis.