2.1. About Attributes, Patterns, Datasets, Validation, Performance, and Main Approaches to Pattern Classification
When working in pattern classification, the first step is to select and understand the tiny area of the universe it is desired to analyze. Once this selection has been made, it is important to identify the useful attributes (also called features) that describe the phenomenon under study.
The raw material to work in pattern recognition and related disciplines are the data, which have specific values of the attributes. For example, the attribute “number of children” could correspond to specific data, or values, such as: 0, 1, 3, 12, 20, while the values 1.78, 1.56, 2.08, measured in meters, could be data from an attribute called “height”. These two attributes are numerical. On the other hand, there are also categorical attributes and missing values. The data associated with the “sex” attribute could be F and M, and the “temperature” attribute could have the following data as specific values: hot, warm, cold.
The next step is to convert those features into something the computer can understand, and at the same time it is appropriate to be handy. The most common and easy thing to do is to use a vector which is composed of numbers that represent real-valued features. For instance, the color of a leaf can be represented by numbers that indicate the intensity of the color, size, morphology, smell, among others. For categorical or mixed attributes, it is possible to use arrays of values. Those representing arrays are known as patterns.
Patterns with common attributes form classes, and these sets representing various classes of a phenomenon under study, when joined, form datasets. For example, the vector [5.4, 3.7, 1.5, 0.2] represents a flower, and the four attributes (in cm) correspond to the sepal length, sepal width, petal length, and petal width, in that order. This flower is a pattern that belongs to one of the three classes (Iris Setosa) contained in the famous Iris Dataset (
https://archive.ics.uci.edu/ml/datasets/iris).
Given a phenomenon under study, it is of vital importance to select properly the attributes and data, to form the patterns that correspond to each of the classes. Typically, the choice of attributes (how many and of what type), the formation of patterns and the design of datasets are all creative processes where the active participation of experts in the specific area of application is required. Experts in the areas of application (for example, physicians to classify diseases, or economists to classify financial risks) join with experts in pattern recognition, machine learning, artificial intelligence, and data mining to create useful datasets to solve problems in different application areas [
1,
2,
3,
4].
There are many pattern classifiers that only work with numerical attributes. In these cases, preprocessing has to be applied to the data in order to convert the categorical data to numerical ones and to impute the missing values [
14]. After preprocessing, there will only be numerical values.
The patterns dimension can be from 3 or 4 attributes, up to hundreds, thousands or even millions, as it happens with the patterns of the datasets that contain bioinformatics data. For their part, datasets vary in size and can have millions of patterns in the field of big data [
15].
A very important aspect related to classes is balance or imbalance. The ideal case is that in a dataset, all classes have the same size. However, the most interesting datasets are far from the ideal case. For example, in the datasets used for the classification of diseases, typically the class “sick” is the minority class (which is desirable). From the computational point of view, the imbalance has consequences related to the way of measuring the performance of the classifiers [
1].
Given a dataset, the imbalance is measured using the imbalance ratio (IR), which is calculated as follows [
16]:
A dataset is considered imbalanced if it happens that .
There exist dataset repositories that have become important auxiliaries to those who develop algorithms and models in pattern recognition and related disciplines. Notable examples are the repositories powered by prestigious universities, of which two stand out. One of them is the repository
http://archive.ics.uci.edu/ml/datasets.php, which was made available to research groups by the University of California Irvine, USA. The other repository is sponsored by the University of Granada in Spain:
http://sci2s.ugr.es/keel/datasets.php. Recently, a dataset repository that runs contests has become very popular, where researchers test their pattern recognition algorithms on real-life datasets, with the chance to win cash prizes:
https://www.kaggle.com/datasets.
The researchers in pattern recognition have certain software platforms at their disposal, where they can develop and test pattern classification algorithms. One of the most useful and famous platforms was developed by scientists from Waikato University in New Zealand:
www.cs.waikato.ac.nz/ml/weka/. WEKA is an analysis environment platform written on Java. This platform contains a vast collection of algorithms for the analysis of data and predictive models that include methods that address problems of regression, classification, clustering, and feature selection. The software allows a dataset to be pre-processed, if required, and inserts it into a learning scheme in order to analyze the results and the performance of a selected classifier.
In the supervised paradigm, a classifier consists of two phases: learning and classification [
17]. The classifier requires a training set to perform the learning. Already trained, the classifier runs the classification phase, where a class is assigned to testing pattern. A common practice is that given a dataset D, a partition is made into two subsets: L (learning or training) and T (test). The classifier learns with L and the patterns in T are used for testing (
Figure 1).
Since it is a partition, the following must be true: and .
The partition is obtained through a validation method, among which the leave-one-out method stands out as one of the most used in the world of pattern recognition research [
18]. As illustrated in
Figure 2, in each iteration, a single testing pattern (in T) is left, and the model learns with the patterns in L, which are precisely the rest of the patterns in D.
Leave-one-out is a particular case of a more general cross-validation method called stratified k-fold cross-validation. It consists of partitioning the dataset into k folds, where k is a positive integer (k = 5 and k = 10 are the most popular values in the current specialized literature), ensuring that all classes are represented proportionally in each fold [
19].
The operation of the k-fold cross-validation is very similar to the schematic diagram in
Figure 2, except that instead of a pattern, one of the k folds is taken. Note that leave-one-out is a particular case of k-fold cross-validation with k = N, where
N is the total number of patterns in the dataset.
In the experimental section of this article, we used the 5-fold cross-validation procedure. The reason is that some leading authors recommend this validation method especially for imbalanced datasets [
20].
Figure 3 and
Figure 4 show schematic diagrams of the 5-fold stratified cross-validation method for a 3-class dataset.
For datasets that return a value much greater than 1.5 when applying Equation (1), it is recommended to use a variant of cross-validation [
21]. This is the 5 × 2 cross-validation method, which consists of partitioning the dataset into 2 stratified folds and performing 5 iterations. The process is illustrated in
Figure 5.
After applying a validation method to the dataset under study, the researcher has at his disposal the partition of dataset D in sets L and T. Now a relevant question arises: how is the performance of a classification algorithm measured?
Common sense indicates that calculating accuracy is a good way to decide how good a classifier is. Accuracy is the ratio of the number of patterns classified correctly, among the total patterns contained in the testing set T, expressed as a percentage. If N is the total number of patterns in the testing set, T and C are the number of patterns correctly classified, the value of accuracy is calculated using Equation (2):
Obviously, it is true that , so that .
For example, consider a study where dataset D is balanced and the testing set T contains patterns of 100 patients, 48 healthy and 52 sick. If an A1 classification algorithm learns with L and when testing it with T correctly classifies 95 of the patients, the accuracy of A1 in D is said to be 95%. In general, values above 90% for performance are considered to indicate that it is a good classifier. Accuracy is a very easy measure of performance for classifiers, but it has a huge disadvantage: it is only useful for datasets that give a value less than 1.5 when applying Equation (1).
Now, we will illustrate with a hypothetical example what might happen when using accuracy as a performance measure on a severely imbalanced dataset. To do this, consider again the study of the previous example, but now with D severely imbalanced (IR much greater than 1.5). When applying a stratified validation method, the testing set T also consists of 100 patients, with the difference that there are now 95 healthy and 5 sick.
To describe the hypothetical example, we are going to invent a very bad A2 classification algorithm. This A2 algorithm consists of assigning the “healthy” class to any testing pattern, whatever it may be. This classifier is very bad because the decision made is totally arbitrary, without even considering the values of the attributes in the patterns. When A2 is tested with the new T, the number of patterns classified “correctly” is 95, as in the previous example, and therefore the accuracy value is 95%, although we know that the A2 classifier is a fake one.
This example illustrates that using accuracy as a measure of classifier performance on an imbalanced dataset can privilege the majority class. What is surprising is that the behavior of the A2 classifier is replicated in many of the state-of-the-art classifiers. That is, many of the classifiers used by researchers in pattern recognition override the minority class in imbalanced datasets when accuracy is used as a performance measure. Typically, in pattern recognition, the jargon of the medical sciences is applied, and the minority class of a imbalanced dataset is called the “positive” class, while the majority class is the “negative” class.
The solution to the problem previously described is the use of the confusion matrix [
22]. Without loss of generality, a schematic diagram of a confusion matrix for two classes (positive and negative) is included in
Figure 6.
where:
TP is the number of correct predictions that a pattern is positive.
FN is the number of incorrect predictions that a positive pattern is negative.
TN is the number of correct predictions that a pattern is negative.
FP is the number of incorrect predictions that a negative pattern is positive.
If we note that the total of patterns correctly classified is TP + TN and that the total of patterns in the set T is TP + TN + FP + FN, Equation (3) is the way to express Equation (2) in terms of the elements of the confusion matrix:
A possible confusion matrix with which an accuracy value of 95% is obtained in the first example is shown in
Figure 7. There we can see that of the 48 “sick” (positive class), 47 were correctly classified as “sick” (TP), and only one of the “sick” was incorrectly classified (FN) as healthy (negative class). Furthermore, of the 52 “healthy” (negative class), 48 were classified correctly as “healthy” (TN), and 4 of them were incorrectly (FP) classified as sick (positive class).
According to Equation (3), the 95% value for accuracy is obtained by dividing the sum of the correctly classified patterns (TP + TN = 47 + 48 = 95) by the total of the pattern in the set T (TP + TN + FP + FN = 47 + 48 + 4 + 1 = 100), and multiplying the result by 100.
For the hypothetical second example, the confusion matrix is illustrated in
Figure 8. The A2 classification algorithm classifies all the patterns in T as if they were of the negative class (“healthy”). Indeed, the accuracy value appears to be good (95%), but in reality it is a useless result because the “classifier” A2 was not able to detect any pattern of the positive class (“sick”).
Of course, it is difficult to find in the specialized literature any classification algorithm as bad as the A2 algorithm. The possibility of any state-of-the-art classifier giving zero as a result in FP or FN is practically null. Most classifiers behave “well” against datasets with severe imbalances. The purpose of any state-of-the-art classifier is, then, to minimize the values of FP, FN, the sum of FP and FN, or some performance measures derived from the confusion matrix.
Here is precisely the alternative to the problems caused by using accuracy as a performance measure on imbalanced datasets. The alternative is to define new performance measures that are derived from the confusion matrix.
There are a very large number of different performance measures that are derived from the confusion matrix [
23]. However, here we will only mention the definitions of three: sensitivity, specificity and balanced accuracy, because these three performance measures will be used in the experimental section of this document [
22].
In the confusion matrix of
Figure 6, it can be easily seen that the total of patterns of the positive class is obtained with this sum: TP + FN. Also, it is evident that the total of patterns of the negative class is obtained with this sum: TN + FP.
The fraction of the number of positive patterns correctly classified with respect to the total of positive patterns in T is called sensitivity:
As the value of TP approaches the total of correctly classified positive patterns, FN tends to zero, and the value of sensitivity tends to 1. The ideal case is when FN = 0, so sensitivity = 1.
The undesirable extreme case is when TP = 0 and this means that the classifier failed to detect any positive pattern, as in
Figure 8. In this case, sensitivity = 0.
The fraction of the number of negative patterns correctly classified with respect to the total of positive patterns in T is called specificity:
As the value of TN approaches the total of correctly classified negative patterns, FP tends to zero, and the value of specificity tends to 1. The ideal case is when FP = 0, so specificity = 1.
The undesirable extreme case is when TN = 0 and this means that the classifier failed to detect any negative pattern. In this case, specificity = 0.
It is not difficult to imagine that there is some parallelism between accuracy, sensitivity, and specificity, because these last two measures can be thought of as a local accuracy, by class. Sensitivity is a kind of accuracy for the positive class, while specificity is a kind of accuracy for the negative class. Therefore, it should not be strange that based on both measures, sensitivity and specificity, a performance measure is defined for imbalanced datasets, which takes both classes into account separately. This is balanced accuracy (BA), a measure that is defined as the average of sensitivity and specificity:
From the values of the confusion matrix in
Figure 7, we can calculate the performance measures of Equations (4)–(6). Sensitivity = 0.98, specificity = 0.92, and BA = 0.95, a value close to the ideal case. On the other hand, when performing the same calculations with the data from the confusion matrix in
Figure 8, we obtain the following results: sensitivity = 0, specificity = 1, and finally BA = 0.5. Note how BA punishes the value 0 in the sensitivity that was obtained with the “classifier” A2.
So far we have talked about classification algorithms without specifying how these algorithms perform the classification of patterns in the context of pattern recognition. The time has come to comment on the theoretical foundations on which classifiers rest, and the different approaches to pattern classification derived from these scientific concepts.
The most important state-of-the-art approaches will be mentioned in this brief summary of pattern classification algorithms. For each approach, the philosophy of the approach will be described very concisely, in addition to the scientific ideas on which it rests. In addition, one of the algorithms representing that approach will be mentioned, which will be tested for comparison in the experimental section of this paper.
It must be emphasized that all the pattern classification algorithms against which our proposal will be compared in the experimental section of this paper are part of the state of the art [
24]. Additionally, it is necessary to point out something important that shows the relevance and validity of the pattern classification algorithms used in this paper: each and every one of these algorithms is included in the WEKA platform. This is relevant because WEKA has become an indispensable auxiliary, at the world level, for researchers dedicated to pattern recognition [
25].
In the state of the art, it is possible to find a great variety of conceptual bases that give theoretical support to the task of intelligent classification of patterns. One of the most important and well-known theoretical bases is the theory of probability and statistics, which gives rise to the probabilistic-statistical approach to pattern classification. The Bayes theorem is the cornerstone on which this approach to minimize classification errors rests, hence the classifiers are called Bayes classifiers [
26]. It is an open research topic and researchers continue to search for new modalities that improve the algorithms of the Bayes classifiers. [
27].
Metrics and their properties cannot be missing as the conceptual basis of one of the most important approaches of pattern recognition. In 1967, scientific research into pattern classification was greatly enriched when Cover and Hart created the NN (nearest neighbor) classifier [
28]. The idea is so simple, that at first glance it seems futile to take it into account. Given a testing pattern, the NN rule is to assign it the class of the pattern that is closest in the attribute space, according to a specified metric (or a dissimilarity function). But the authors of the NN classifier demonstrated their theoretical support. Furthermore, the large number and quality of applications throughout these years show the effectiveness of the NN classifier and its great validity as a research topic [
29]. The extension of the NN model led to the creation of k-NN, where the NN rule is generalized to the k nearest neighbors. In the k-NN model, the class is assigned to the test pattern by majority of the k closest learning set patterns [
30]. Nowadays, k-NN classifiers are considered among the most important and useful approaches in pattern classification [
31].
On the other hand, tree-based classifiers use the advanced theory of graphs and trees to try to keep the number of errors to a minimum when solving a classification problem [
32]. In a decision tree, each of the internal nodes represents an attribute, and final nodes (leaves) constitute the classification result. All nodes are connected by branches representing simple if-then-else rules which are inferred from the data [
33]. Decision trees are simple to understand because they can be depicted visually, data require little or no preparation at all, and it is able to handle both numerical and categorical data. The study of decision trees is a current topic of scientific research [
34].
It is common for the logistic function to be present in the experimental section of many articles of pattern classification. In the specialized literature, the logistic regression classifier is mentioned, because the logistics function was originally designed to perform the regression task. However, as the name implies, the algorithm is a classifier [
35]. This function is sigmoid (due to the shape of its graph) and its expression involves the exponential function whose base is the Euler’s constant or number
e [
36]. The argument to the input of the logistic function is a real number and to the output a real value is obtained in the open interval (0,1). Therefore, the logistic regression function is useful for classifying patterns on datasets of two classes [
37].
Before the scientific revolution generated by the arrival of deep learning in 2012, the “enemy to be defeated” in comparative studies of pattern classifiers was a model known as support vector machines (SVM). This model originates from the famous statistical learning theory [
38], and was unveiled in a famous article published a quarter of a century ago [
39]. The optimization of analytical functions serves as a theoretical basis in the design and operation of SVM models, which attempts to find the maximum margin hyperplane to separate the classes in the attribute space. Although it is true that deep learning-based models obtain impressive results in patterns represented by digital images, it is also a fact that SVMs continue to occupy the first places in performance in classification problems where patterns are not digital images [
40].
The “scientific revolution generated by the arrival of Deep Learning in 2012” mentioned in the previous paragraph began many years ago, in 1943, when McCulloch and Pitts presented the first mathematical model of human neuron to the world [
41]. With this model began the study of the neuronal classifiers, which acquired great force in 1985–1986. In those years, several versions of an algorithm that allows training successfully neural networks were published [
42,
43]. This algorithm is called backpropagation, and two of the most important pioneers of the “Deep Learning era” participated in its creation: LeCun and Hinton [
44]. Among the most relevant neural models with applications in many areas of human activity, the multi-layer perceptron (MLP) is one of those rated as excellent by the scientific community [
45]. Therefore we have included it in the comparative study presented in the experimental section of this paper.
The last classifier that we have selected to be included in the experimental section of this paper does not belong to a different approach from those previously described. Rather, it is a set of classifiers from some of the approaches mentioned, which are grouped into an ensemble, where the classifiers operate in a collaborative environment [
46]. Ensemble classifiers are methods which aggregate the predictions of a number of diverse base classifiers in order to produce a more accurate predictor, with the idea being that “many heads think better than one”. The ability to combine the outputs of multiple classifiers to improve upon the individual accuracy of each one has prompted much research and innovative ensemble construction proposals, such as bagging [
47] and boosting [
48]. Ensemble models have positioned themselves as valuable tools in pattern classification, routinely achieving excellent results on complex tasks. In this paper, we selected a boosting ensemble, the AdaBoost algorithm [
49], using C4.5 as base classifier.
2.2. Lernmatix: The Original Model
The Lernmatrix is an associative memory [
13]. Therefore, since an associative memory performs the task of pattern recall in pattern recognition, in principle the Lernmatrix receives patterns at the input and delivers patterns at the output. However, if the output patterns are chosen properly, the Lernmatrix can act as a pattern classifier. The Lernmatrix is an input-output system that accepts a binary pattern at the input. If there is no saturation, the Lernmatrix generates a one-hot pattern at the output. The saturation phenomenon will be amply illustrated in this subsection. A schematic diagram of the original Lernmatrix model is shown in
Figure 9.
If
M is a Lernmatrix and
is an input pattern, an example of a 5-dimensional input pattern is:
In Expression (7) the superscript 1 indicates that this is the first input pattern. In general, if μ is a positive integer, the notation indicates the μ-th input pattern. The j-th component of a pattern is denoted as:.
The key to a Lernmatrix acting as a pattern classifier is found in the representation of the output patterns as one-hot vectors. It is assumed that in a pattern classification problem, there are
p different classes, where
p is a positive integer greater than 1. For the particular case in which
p = 3, class 1 is represented by this one-hot vector:
while the representations of classes 2 and 3 are:
In general, to represent the class , , the following values are assigned to the output binary pattern: , and for , this value is assigned .
The expressions for the learning and recalling phases were adapted from two articles published by Steinbuch: the 1961 article where he released his original model [
13], and an article he published in 1965, co-authored with Widrow [
50], who is the creator of one of the first neuronal models called ADALINE.
Learning phase for the Lernmatrix.
To start the learning phase of a Lernmatrix of
p classes and with
n-dimensional input binary patterns, a
is created, with
,
.
For each input pattern
and its corresponding output pattern
, each component
is updated according to the following rule:
, with
.
Remark 1. The only restriction for the value ofis that it be positive. Therefore, it is valid to choose the value ofas 1. In all the examples in this paper, we will use the value.
Example 1. Execute the learning phase of a Lernmatrix that has 3 classes, and 5-dimensional input patterns (one input pattern for each class): Initially, amatrix is created with all its inputs set to zero. Then, the transpose of the first input pattern is placed on top, and the first class pattern to the left of the matrix. Then, the learning rule of Equation (11) is applied to each of the components of both patterns: | | | | | |
| 0+1 | 0−1 | 0+1 | 0−1 | 0+1 |
| 0+0 | 0+0 | 0+0 | 0+0 | 0+0 |
| 0+0 | 0+0 | 0+0 | 0+0 | 0+0 |
The Lernmatrix looks like this after learning the pattern association:
| | | | | |
| 1 | −1 | 1 | −1 | 1 |
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 |
Now, the learning rule of Equation (11) is applied to each of the components of both patterns of the second association:
| | | | | |
| 1 | −1 | 1 | −1 | 1 |
| 1 | 1 | −1 | −1 | 1 |
| 0 | 0 | 0 | 0 | 0 |
When applying the learning rule of Equation (11) to each of the components of both patterns of the third association, the matrix becomes: | | | | | |
| 1 | −1 | 1 | −1 | 1 |
| 1 | 1 | −1 | −1 | 1 |
| 1 | −1 | 1 | 1 | −1 |
Since the rule of Equation (11) has already been applied to all pattern associations, the learning phase has concluded, and finally the Lernmatrix is: Recalling phase for the Lernmatrix.
If
is a
n-dimensional input pattern whose class is unknown, the recalling phase consists of operating the Lernmatrix with that input pattern, trying to find the corresponding
p-dimensional one-hot vector
(i.e., the class). The
i-th coordinate
is obtained according to the next expression, where ∨ is the maximum operator:
Example 2. Now we are going to apply Equation (14) to each of the input patterns of Expression 12, with the Lernmatrix of Expression 13. For i = 1
we have:; for
i = 2:
;
and for i = 3:
. According to Equation (14), the value 1 will be assigned to the coordinate that gives the greatest sum, and 0 to all the others. This means that input patternwill be assigned the class vector, which is correct according to Expression 12. When doing the same with input vectorsand, we have: Example 3. In Example 2, all input patterns were correctly assigned the class in the recalling phase. That is true, but an interesting question arises: what will happen to the recalling phase if there are more input patterns than classes? To find out the answer, we are going to add a new input pattern to class 1 of the Lernmatrix of Expression 13: When applying the learning rule of Equation (11) to each of the components of both patterns of the association, the matrix becomes: | | | | | |
| 1−1 | −1+1 | 1−1 | −1+1 | 1+1 |
| 1 | 1 | −1 | −1 | 1 |
| 1 | −1 | 1 | 1 | −1 |
It can be easily verified that if we apply Equation (14) to each of the input patterns of Expression 12 with the Lernmatrix of Expression 16, all three classes are correctly assigned by the Lernmatrix. What will happen to input pattern? Again, all input patterns were correctly assigned the class in the recalling phase. Another interesting question arises: will the Lernmatrix correctly assign classes to all patterns every time a new input pattern is added? The answer is no. Example 4 will illustrate a disadvantage exhibited by the Lernmatrix: a phenomenon called saturation.
Example 4. We are going to add a new input pattern to class 3 of the Lernmatrix of Expression 16: When applying the learning rule of Equation (11) to each of the components of both patterns of the association, the matrix becomes: | | | | | |
| 0 | 0 | 0 | 0 | 2 |
| 1 | 1 | −1 | −1 | 1 |
| 1−1 | −1−1 | 1+1 | 1−1 | −1+1 |
It can be easily verified that if we apply Equation (14) to each of the input patterns,, andwith the Lernmatrix of Expression 17, all three classes are correctly assigned by the Lernmatrix. Now we are going to verify what happens with input pattern.
Here is the Lernmatrix saturation phenomenon, because the output pattern is not one-hot, so there is ambiguity: what class should we assign to input pattern?, Should we assignclass 1 or class 3?
The reader could easily verify that something similar occurs with input pattern, where the saturation phenomenon also occurs and consequently there is ambiguity.
It is worth noting something that happened in the previous 4 examples. The Lernmatrix learned with 5 input patterns, and each of those same input patters was used as a testing pattern. This is contrary to the concepts illustrated in Figure 1, regarding the partition to be made to dataset D in two sets, one set L for learning and the other set T for testing. However, in the four previous examples, the following occurs: D = L = T. This strange procedure exists and has a technical name: it is called resubstitution error [51]. It is not an authentic method of validation, and is only used when we want to know the trend of a new classifier. In the 4 examples mentioned, we have used 5 patterns out of the 32 available, since with 5 bits it is possible to form different binary patterns. We will take advantage of the results of the 4 previous examples in order to exemplify the behavior of the Lernmatrix when there is a dataset D which is partitioned into L and T, as illustrated in Figure 1. It must be taken into account that due to its nature as associative memory, in the case of the Lernmatrix, the dataset D is made up of associations of two patterns, an input pattern and an output pattern, which represents the class. The same goes for L and T. Example 5. Specifically, we will assume that dataset D contains 8 associations, of which the first 5 of the previous 4 examples form learning set L and there are 3 more that form testing set T. That is, the learning setand the testing set is, where: After the learning phase where Expressions 10 and 11 were applied to the 5 patterns in set L, the Lernmatix is precisely the matrix of Expression 17: During the recalling phase, the operations of Expression 14 are performed with the Lernmatrixand each of one of the testing patterns,, and.
Two of the three patterns of the testing set T were correctly classified, and the saturation appears in the testing pattern . If we use Expression 2 to calculate accuracy, we have: Now that we have illustrated both phases of the Lernmatrix with very simple examples, it is worth investigating how good the Lernmatrix is as a pattern classifier in real datasets. Also, and since most real-life datasets do not contain only binary data, we must be clear about what we must do to apply the Lernmatrix on those datasets.
To illustrate how good the Lernamtrix is as a pattern classifier in real datasets, we will take the first dataset that we have included in the experimental section of this paper as an example. This is the ecoli-0_vs_1 dataset, which contains 220 patterns of 7 numerical attributes, and which can be downloaded from the website: http://www.keel.es/. The objective of this problem is to predict the localization site of proteins by employing two measures about the cell: cytoplasm (class 0) and inner membrane (class 1). To apply the Lernmatrix, we must binarize the patterns. Since each pattern has 7 numerical attributes and each attribute requires 8 bits to represent it, then each pattern in the dataset consists of 56 binary attributes. As previously discussed, we decided to use the stratified 5-fold cross-validation method, and balance accuracy as a performance measure.
After each of the 5 learning phases, a 56 × 2 Lermatrix resulted. Table 1 includes the performance of the Lernamtrix and also presents the performance values of 7 state-of-the-art classifiers (the complete table is included in Section 5). As can be seen, the Lernmatrix as a dataset classifier, is very far from the performance values given by the most important state of the art classifiers. However, the main contribution of this paper is the proposal of a novel and simple mathematical transform, which will make it possible to significantly increase the performance of the Lernmatrix, to the point of making this ancient and beautiful model a competitive classifier against the most relevant classifiers in the state of the art.
In the next subsection, we will explain the reasons why the authors of this paper have decided to try to rescue the Lernmatrix, despite the modest performance results in Table 1. 2.4. Lernmatix: Theoretical Advances
One of the first research actions that the Alpha-Beta group undertook when the researchers decided to work with the Lernmatrix was to create a new framework. This new framework rests on two initial definitions.
These two initial definitions served as decisive support to facilitate the statement and demonstration of the lemmas and theorems that make up the theoretical foundation of the Lernmatrix [
71].
Definition 1. A Steinbuch function is any functionwith the property: Example 6. The functionis a Steinbuch function, becauseand, according to Definition 1.
Definition 2. Letbe a Steinbuch function. A Steinbuch vectorial function foris any functionwith the following property: In this paper, we have described the Lernmatrix as a pattern classifier, where its recalling phase is actually a classification phase in the supervised pattern classification paradigm. For this reason, we take into account what is exposed in Figure 1. When designing a Lernmatrix, we assume that there is a dataset D, from which a partition is made in two subsets: L (learning or training) and T (testing). The Lernmatrix learns with L and the patterns in T are used for testing. In the case of the Lernmatrix, the set L is made up of associations of two patterns, an input pattern and an output pattern, which represents the class.
If we denote by m the number of associations that the learning set L contains, we can represent it with the following expression: Learning phase for the Lernmatrix in the new framework.
Let
be a learning set, let
f be a Steinbuch function, and let
be an Steinbuch vectorial function for
. The Lernmatrix
M is built according to the next rule:
As previously discussed, the only restriction for the value of
is that it be positive. In the spirit of simplifying expressions, henceforth we will assume that
It can easily be verified that Expression 22 is equivalent to Expressions 10 and 11, which define the learning phase of the original Lernmatrix model. To illustrate this equivalence we will replicate the Example 1 (Expression 12) in the new framework.
Example 7. Makingin Expression 21, the first association in Example 1 generates this: Performing similar operations for the other two associations in Example 1, and applying Expression 22: Note that this result coincides with Expression 13. We are going to perform the same operations for the fourth association , to obtain the Lernmatrix(Expressions 15 and 16). The reader can easily verify that the result for the Lernmatrixcoincides with Expression 17.
Recalling phase for the Lernmatrix in the new framework.
Let
M be a Lernmatrix built using Expression 22, and let
be an association of patterns with attributes and dimensions as in Expression 20. The output pattern
is obtained by operating the Lermatrix
M and pattern
, according to the operations specified by Equation (23):
If there is saturation, the output pattern is not necessarily equal to .
Definition 3. If, then the recalling is correct.
Example 8. It can easily be verified that Expression 23 is equivalent to Expression 14, both defining the recalling phase of the original Lernmatrix model. To illustrate this equivalence, we will replicate the first case of Example 2 in the new framework. Something similar occurs with the remaining cases in Example 2, and with all the cases in Examples 3, 4 and 5.
According to [68], the original model of the Lernmatrix suffers from a big problem: saturation, which has been one of the enemies to beat by the Alpha-Beta group since 2001. Saturation, as illustrated in Example 4, is considered as the overtraining of a memory, to the extent that it is not possible to remember or recall, correctly, the patterns learned. In other words, the class obtained is not one of those established in the learning set L. This fact, visualized in the recalling phase, generates another problem known as ambiguity. This problem is the inability to determine the class associated with a certain input pattern, because the Lernmatrix output pattern is not a one-hot pattern. Ambiguity can be caused by two reasons that have been clearly identified by the Alpha-Beta group: (1) due to memory saturation or (2) due to the structure inherent to the patterns belonging to dataset D.
Ambiguity is another of the enemies to be defeated by the Alpha-Beta group since 2001. Over almost two decades of research, we have proposed several partial solutions to these two problems. Before presenting some of these partial solutions that the Alpha-Beta group has published, it is necessary to previously define the alteration between patterns, and the corresponding notation.
Definition 4. Two n-dimensional binary patternsare equal (denoted by) if and only if: Example 9. In Example 8, the two 3-dimensional binary patternsandare equal, because they satisfy the condition of Definition 4. However, patternsandin Example 5 are not equal because for j = 1, it happens that.
Definition 5. Letbe two n-dimensional binary patterns. The patternin less than or equal to pattern(denoted by) if and only if: Example 10. The patternof Example 5 is less than or equal to patternof Example 1 (), because every time it happens that(in indices j = 1 and j = 2), it is also true that, according to Definition 5. Example 11. When considering the patternof Example 5 and the patternof Example 1, the expressionis false according to Definition 5. The reason is thatbut, which contradicts Definition 5, because the expression is false. Definition 6. Letbe two n-dimensional binary patterns such thatfrom the Definition 5. Ifsuch that, then the patternexhibits subtractive alteration with respect to the pattern. This is denoted by.
Example 12. From Example 10,. Also, the patternexhibits subtractive alteration with respect to the patternbecauseaccording to Definition 6. Thus, this expression is true:.
Below is a brief anthology of the most important advances obtained by the Alpha-Beta group, in relation to the theoretical foundations of the Lernmatrix. These results serve as a solid basis for the proposal of the mathematical transform, which represents the main contribution of this paper. In turn, this novel and simple mathematical transform is the cornerstone of the methodology that gives rise to the new Lernmatrix, a model that is now competitive against the most important classifiers of the state of the art. All these results can be consulted in these papers: [68,69,70,71,73]. Definition 7. Letbe a n-dimensional binary pattern. The characteristic set ofis defined by. The cardinality of the characteristic setis the number of ones in the pattern.
Example 13. From Examples 1 and 5, it is possible to verify the following expressions:and;and;and;and;and;and.
Lemma 1. Letbe two n-dimensional binary patterns. Thenif and only if.
Proof. By Definition 6,if and onlyandsuch that; if and only if (Definition 5)and; if and only if (Definition 7)and; if and only ifand; if and only if. □
Remark 2. Hereinafter, the symbol □ will indicate the end of a proof.
The importance of Lemma 1 lies in showing that an order relationship between patterns implies an order relationship between their characteristic sets and vice versa.
Example 14. From Example 10, and from Example 13, we have.
Lemma 2. Let M be a Lernmatrix which was built from the learning setby applying Expression 22, fulfillingif and only if. Let be a n-dimensional binary pattern and. Then the k-th component ofis obtained by the following expression:.
Proof. By Expression 22,Since the patternsare one-hot, the Lernmatrix M can be written as follows: By performing the summation:
Hence, the
k-th component is:
Only the components of
that are equal to 1 contribute to the sum, and by Definition 7 we have:
This summation has exactlyterms, which may be 1 or −1 according to Definition 1.
If we know the number of terms with value 1 and the number of terms with value −1, the result of the summation 30 is calculated using elementary algebra. If we denote bythe number of terms with value 1, and bythe number of terms with value −1, the result of the summation 30 is: Since the summation has exactlyterms, the number of terms with value −1 is: So Expression 31 becomes:
We are going to analyze the application of Definitions 1 and 7 in Expression 30, in order to elucidate the meaning of the quantity. According to Definition 1, positions with value 1 in patternremain inwith value 1, ifis a Steinbuch function. That is, the characteristic set ofis equal to, according to Definition 7. However, according to Expression 30,does not count all the values 1 of pattern, but is restricted to those positions of the characteristic set, where patternhas value 1 according to Definition 7. Thus,is equal to the cardinality of the intersection of both characteristic sets: So, by substituting 34 in Expression 33, we obtain the thesis: . □
Example 15. When operating the Lernmatrix of Expression 13 with the second pattern of the learning set L, we have (Example 2),and:
For k = 1,, and.
For k = 2,, and.
For k = 3,, and .
Example 16. If we wanted to verify that Lemma 2 is fulfilled in the results of Example 3, it would not be possible, because the hypothesis is violated. In Example 3, there are two equal output patterns but whose indices are different:.
Remark 3. By applying the methodology proposed in this paper, this problem disappears because in the associations of the learning set L all the output patterns are different. We have used this successful idea previously in [61]. Example 17. A solution to the problem exposed in Example 16 is to modify the matrix of Expression 30, to create a new Lernmatrix by adding pattern 4. But the output pattern would no longer be, but a four-bit one-hot pattern, modifying the three output patterns of,, andto four bits. The new Lernmatrix is converted as amatrix and now we can verify that Lemma 2 is fulfilled, with,and : For k = 1,, and.
For k = 2,, and.
For k = 3,, and.
For k = 3,, and.
Lemma 3. Let M be a Lernmatrix which was built from the learning setby applying Expression 22, fulfillingif and only if. Letbe a n-dimensional binary pattern. Thenis the output pattern (for) of the Lernmatrix recalling phase with, if and only if,,.
Proof. Since
is a one-hot pattern, its components fulfill the following condition:
By Expressions 23 and 35,is the output pattern forif and only if: This occurs if and only if:
for any arbitrary index, such that,
.
So, by substituting 38 in Expression 37, we obtain the thesis:
□
Example 18. We will illustrate Lemma 3 with the patterns of the learning set L of Example 1 and the Lermatrix obtained in Expression 13. We will obtain the output pattern for a pattern that does not belong to L.
This is the learning set L of Example 1: This is the Lermatrix obtained in Expression 13: We will choose the patternnot belonging to L as the input pattern to this Lernmatrix. In this example,, and we will verify the inequality of Lemma 2, with the valuesand. Also:, andand
For,. So that:.
For,. So that: .
Theorem 1. Let M be a Lernmatrix which was built from the learning setby applying Expression 22, fulfillingif and only if. Letbe a n-dimensional binary pattern which exhibits subtractive alteration with respect to the pattern ,for some. Thenis the output pattern (for) of the Lernmatrix recalling phase if and only if the expression(according to Definition 6) is false,.
Proof. By Lemma 3,is the output pattern forif and only if: By Definition 6, given thatis a n-dimensional binary pattern which exhibits subtractive alteration with respect to the pattern , for some, we have, and by Lemma 1 this is true if and only if: By elementary set theory, Expression 40 is true if and only if:
By substituting Expression 41 in Expression 39, we have:
Expression 42 is equivalent to:
By contradiction, we suppose that the negation of the proposition 43 is true:
This proposition is equivalent to:
By Expressions 44, 41, and 39, we have:
By Lemma 1, Expression 43 is true if and only if:
□
What is expressed in Theorem 1 is very relevant for the state of the art in the topic of associative memories, when these models convert the task of pattern recalling into the task of pattern classification. The relevance lies in that Theorem 1 provides necessary and sufficient conditions for the Lernmatrix to correctly classify a patternthat does not belong to learning set L. The patternbelongs to testing set T, and is characterized by exhibit subtractive alteration with respect to some pattern belonging to the learning set L, say.
However, to achieve the correct classification of thepattern, the condition included in Theorem 1 is very strong. Theorem 1 requires this condition to be fulfilled: there must be no subtractive alteration of testing patternwith respect to any of the patterns in the learning set L, different from.
As mentioned previously, saturation and ambiguity are two problems that the Alpha-Beta group has faced since almost twenty years ago. It is now obvious that both problems appear as a direct consequence of the strong condition of Theorem 1 not being fulfilled. This is evidenced in the poor results that the Lernmatrix shows in the dataset of
Table 1.
During a recent Alpha-Beta group work session, a disruptive idea suddenly emerged: what would happen if we do something to modify the pattern data so that the patterns fulfill the strong condition of Theorem 1? From that moment on, we worked hard in the search for some transform that is capable of eliminating the subtractive alterationfor all values ofdifferent from, assuming that the test patternexhibits subtractive alteration with respect to.
This is precisely the achievement made in this research work. Finally we have found the long-awaited transform, which is proposed in
Section 3. With this new representation of the data, the strong condition of Theorem 1 is fulfilled.
Section 5 of this article shows that with this new representation of the data, the performance of the Lernmatrix increases ostensibly, to the degree of competing with the supervised classifiers of the state of the art. The methodology proposed in this work includes, as one of the relevant methodological steps, the new transform.