The conducted experiments investigate the ability of the proposed approach, DDSAE, to dynamically set the depth of a stacked AutoEncoder. Specifically, experiments were conducted on two benchmark datasets and two real datasets. Moreover, the performance of DDSAE was assessed by feeding the AE learned features to a classifier, and then comparing the classification results using accuracy, precision, recall, and F1 score performance measures.
5.1. Dataset Description
MNIST—MNIST benchmark dataset [
35] includes a training set of 60K examples and testing set of 10K examples. Each data instance is a black and white 28 × 28 image presenting a single handwritten digit from 0 to 9. This dataset is a balanced dataset.
CIFAR-10—CIFAR-10 benchmark dataset [
36] includes 50K training images and 10K testing images. The images represent one of ten natural objects (i.e., airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). Each instance is an RGB image of 32 × 32 pixels. This dataset is a balanced dataset.
Parkinson—Parkinson’s disease classification dataset [
37,
38] consists of 756 instances, each including 754 attributes. The data collected in this dataset are from 188 patients diagnosed with Parkinson’s disease and 64 healthy individuals. The attributes include various speech features extracted from the record of the phonation of vowel /a/. These features include the gender, the Pitch Period Entropy (PPE), the Detrended Fluctuation Analysis (DFA), the Recurrence Period Density Entropy (RPDE), the number of pulses (numPulses), the number of period pulses (numPeriodPulses), the mean of period pulses (meanPeriodPulses), the standard deviation of period pulses (stdDevPeriodPulses), etc. This dataset is an imbalanced dataset.
Breast Cancer—Breast Cancer Coimbra dataset [
39,
40] includes 116 instances with 10 attributes. The data are collected from 64 patients diagnosed with breast cancer and 52 healthy individuals. All attributes are quantitative predictors gathered in a routine blood draw analysis. The features include age, Body Mass Index (BMI), Glucose, Insulin, Homeostatic Model Assessment (HOMA), Leptin, Adiponectin, Resistin, Monocyte Chemoattractant Protein-1 (MCP.1), and classification. This dataset is an imbalanced dataset.
Table 2 reports the details of the considered datasets.
5.3. Experiment 1: Manual Depth Investigation
The first experiment manually investigates the performance of the SAE with different depth values in order to show the effect of the SAE depth on the performance and thus prove the significance of learning the depth of SAEs. In other words, the performance of the SAE is evaluated with different depths. In fact, the depth of the model is tuned from one AE architecture to twenty AE architectures. To demonstrate the impact of the network depth on the classification performance, the experiment is conducted on the two-benchmark datasets MNIST [
35] and CIFAR-10 [
36]. After training the SAE, a SoftMax layer is appended on the encoder part as a classifier to perform the classification task. As the two benchmark datasets are balanced, the accuracy is considered. The latter with respect to different network depths is illustrated in
Figure 5.
As can be observed in
Figure 5a, the best accuracy achieved is 87.20% with the one-AE model. Specifically, the classification accuracy on MNIST [
35] showed that the one-AE architecture achieved a better classification accuracy compared with the 20-AE architecture. In fact, the obtained results showed that as the depth grows, the classification accuracy decreases. It is also observed that the performance of the model with depths from 1 to 4 attained better accuracy than models with deeper depths. In addition, the accuracy drastically decreases and plateaus for models with depth 13 to 20. This is a counterintuitive result showing that increasing the number of layers can decrease the accuracy.
The performance on the CIFAR-10 [
36] dataset is illustrated in
Figure 5b. As can be seen, the model with two-AE architecture attained an accuracy of 25.84%. The achieved results confirm the findings on the MNIST dataset, which shows that as the depth increases, the classification accuracy does not necessarily increase. Moreover, the network with two-AE architecture performs better than other SAE depths.
This steep fall in accuracy, shown in
Figure 5, is related to the considered data. As can be seen, the curve displayed in
Figure 5a is different from the one displayed in
Figure 5b which depicts the accuracy achieved when manually tuning the depth on MNIST [
35] dataset and CIFAR-10 [
36] dataset, respectively. It is probably due to an overfitting problem when the depth is excessive. In fact, as a deep learning model, SAE is prone to overfitting. Specifically, as the number of layers increase, the number of network parameters that should be learned during the training phase increases. As such, it is essential to learn the optimal depth of an SAE that allows learning an abstract representation of the data while not being prone to overfitting. This is a motivation of the proposed approach which learns dynamically the depth while training the stacked AutoEncoder.
5.4. Experiment 2: Performance Assessment of the proposed approach DDSAE on Benchmark Datasets
The performance of the proposed approach DDSAE in dynamically learning the depth of an SAE in an unsupervised manner is assessed on the two benchmark datasets MNIST [
35], and CIFAR-10 [
36]. Initially, the number of AE layers is set to 20. After applying the proposed approach DDSAE, the initial model converged to a one-AE model on both datasets.
Figure 6 displays the number of retained layers (the new learned depth) with respect to each iteration on both datasets MNIST [
35] and CIFAR-10 [
36]. The learned depth has progressively changed from 20 layers (in the first iteration) to one layer (after training 20 batches). Thus, the topology of the SAE changes dynamically while training the model from 20 AEs to one AE. This is consistent with the previous experimental finding.
The steep fall in the depth depicted in
Figure 6 reflects the fast convergence of the algorithm. In fact, after only 20 iterations, the proposed training approach is able to learn the optimal depth of the network.
Figure 7 displays the SAE loss function when training the model using MNIST [
35] and CIFAR-10 [
36] datasets. As can be seen, the loss function continues decreasing even after the convergence of the depth at iteration 20 (refer to
Figure 6). Thus, the stacked AutoEncoder weights are progressively updated during the training. This excludes the possibility of saturation, especially that the employed cross-entropy loss function mitigates the effect of saturated output neurons [
41].
The relevance layer weights are learned after training each batch using the update equation defined in Equation (3) subject to Equations (4) and (5). They represent the importance of each layer. In other words, they give insight on how much the feature map learned at a specific layer contributed to conserving the inner product between the input and the output. As such, they are used to prune irrelevant layers by discarding the layer exhibiting low relevance weights.
Figure 8 and
Figure 9 display the layer relevance weight learned by DDSAE on MNIST [
35] and CIFAR-10 [
36] in the first and last batch, respectively.
As can be observed, the maximum layer relevance weight in the first batch is assigned to the first layer of the encoder. Accordingly, starting from the last layer of the encoder, layers with relevance weight smaller than 75% of the maximum weight are pruned with their corresponding decoder layer. As a result, on MNIST [
35] the number of trainable parameters has dropped from 41,609,784 to 3,571,784. Similarly, the trainable parameters on CIFAR-10 [
36] have decreased from 46,188,072 to 8,150,072 after the algorithm has converged. This yields a lightweight model that reduces the risk of overfitting and enhances the model’s generalization. Moreover, by learning the suitable relevance weight for each layer, the algorithm is able to determine the optimal depth for the models.
To measure the performance of the algorithm, various classifiers were adopted and appended on top of the encoder part. These classifiers are SoftMax layer, Support Vector Machine (SVM), and K-Nearest Neighbor (KNN) with number of neighbors equal to five.
Table 4 reports the classification accuracy on the MNIST [
35] and CIFAR-10 [
36] datasets with respect to the considered classifiers. As can be observed in the obtained accuracies on MNIST [
35], similar performance was achieved with respect to the three considered classifiers. However, the classification accuracy on CIFAR-10 [
36] attained a better performance using the SoftMax classifier compared to SVM and KNN. Moreover, the classification accuracy achieved by DDSAE outperformed the classification accuracy of the manual depth investigation experiment. Specifically, the accuracy achieved by DDSAE on MNIST [
35] is 97%, outperforming the one attained by manual investigation, which is 87.20%. Similarly, manual investigation performance reached 25.84% on CIFAR-10, while the performance of DDSAE reached an accuracy of 38%. Hence, while learning the optimal number of layers, DDSAE improved the performance of SAE.
5.5. Experiment 3: Performance Assessment of the Proposed Approach DDSAE on Real Datasets
To assess the performance of the proposed approach DDSAE in dynamically learning the depth of SAE, two real datasets, Parkinson’s [
37,
38] and Breast Cancer [
39,
40], are considered. As in experiment 2, the initial number of AEs was set to 20. DDSAE converges to one AE layer for both datasets. These results are in accordance with the findings of experiment 2.
Figure 10 shows the depth learned dynamically (the retained number of layers) while training each batch on the Parkinson’s [
37,
38] and Breast Cancer [
39,
40] datasets. As can be seen, the depth decreased from 20 AEs (after training the first batch) to one AE (after training the second batch) for both datasets.
During the training phase, the layer relevance weights are estimated in an unsupervised way at the end of each batch using the derived equation defined in Equation (3) subject to Equations (4) and (5). Giving insight into the importance of each layer, they are employed to dynamically prune irrelevant layers exhibiting low layer relevance weight.
Figure 11 and
Figure 12 depict the layer relevance weights learned by DDSAE on the Parkinson’s [
37,
38] and Breast Cancer [
39,
40] datasets in the first and last batch, respectively. As can be seen, after training the model using one batch, the maximum layer relevance weight is assigned to the first layer of the encoder. Hence, all layers exhibiting a relevance weight smaller than 75% of the maximum weight are pruned. In terms of learnable parameters, the number of learned parameters on Parkinson’s [
37,
38] has dropped from 41,547,753 to 3,509,753 after the convergence of the model to one AE network. In addition, the learned parameters on Breast Cancer [
39,
40] decreased after convergence from 40,059,009 to 2,021,009 parameters.
Moreover, the performance of DDSAE is assessed by appending different classifiers on top of the encoder part after training. Since the datasets are unbalanced, precision, recall, and F1-measure are reported in addition to the classification accuracy.
Table 5 reports the performance attained on the Parkinson’s [
37,
38] and Breast Cancer [
39,
40] datasets using different classifiers. As can be observed, the performance on Parkinson’s with the KNN and SVM classifiers achieved similar results; however, the SoftMax classifier achieved lower results. Similar results were obtained on the Breast Cancer dataset. Spherically, DDSAE reached an F1-score of 77% on the Parkinson’s [
37,
38] dataset and an F1-score of 89% on the Breast Cancer [
39,
40] dataset when using the KNN classifier. Therefore, in addition to learning the optimal depth of the SAE, DDSAE is able to yield good classification results.
5.6. Experiment 4: Performance Comparison of DDSAE with State-of-the-Arts
In this experiment, we aim to compare the performance of the DDSAE algorithm to the state-of-the-art approaches which learn the depth of an SAE in an unsupervised manner. The considered approaches are the related works published in [
26,
27].
The work in [
26], referred to as “Unsupervised restricted depth and width learning for a multi-layer AE”, learns the width and depth of VAE using evolutionary search techniques. It is designed in such a way that the depth cannot exceed five layers and the width varies from 50 to 1000 nodes by a step of 50. The optimization of the depth or the width is achieved using the mutation operator and the fitness is defined as the inverse of the VAE reconstruction error. Similarly, the work in [
27], called “Unsupervised width and depth learning using a chromosome of fixed length”, is a genetic model which optimizes AE width and depth. Specifically, it employs crossover and mutation genetic operators on a 14-gene chromosome to create new candidate solutions. Particularly, the genes of the chromosome encode the number of layers (depth), the number of nodes per layer (width), the activation function, and the loss function. In the following, we refer to the works in [
26,
27] as “Approach1” and “Approach2”, respectively.
The four datasets previously considered (MNIST [
35], CIFAR-10 [
36], Parkinson’s [
37,
38], and Breast Cancer [
39,
40]) are used to assess the performance of the DDSAE approach and compare it to the state-of-the-art approaches. Moreover, the SoftMax classifier was appended on top of the encoder part after training the AE.
Table 6,
Table 7,
Table 8 and
Table 9 depict the performance of DDSAE and compare it with the two state-of-the-art approaches on the four datasets along with the learned depth.
As can be seen in
Table 6, on the MNIST dataset [
35] DDSAE achieved 97% with respect to the considered performance measures. Hence, it outperformed the first approach [
26] by 21% and attained a better classification accuracy compared to the second approach [
27]. All three approaches have converged to an architecture of one AE. Moreover, the results on the CIFAR-10 dataset [
36], depicted in
Table 7, show that DDSAE accuracy reached 38%. It was better than the first approach [
26] by 17% and performed better than the second approach [
27] by 6%. The depth obtained by the three considered approaches is one. The analysis of these results based on the accuracy metrics also applies to the other performance indicators depicted in
Table 6, and 7, namely, F1-score, recall, and precision. The accuracy, F1-score, precision, and recall achieved by MNIST [
35] and CIFAR-10 [
36] are displayed in
Figure 13 and
Figure 14, respectively.
Furthermore, the performance of DDSAE on the Parkinson’s [
37,
38] and Breast Cancer [
39,
40] datasets is depicted in
Table 8 and
Table 9, respectively. The F1-score attained by DDSAE on the Parkinson’s dataset is 48%. This result is slightly higher than the F1-score attained by the first approach [
26], and is higher than the second approach [
27] by 5%. Nevertheless, DDSAE converged to a single-layer model, whereas with the first approach [
26] optimal depth was five layers of AE, and the second approach [
27] reached two layers of AE models.
Finally, the performance on the Breast Cancer dataset [
39,
40] in terms of F1-score reached 68.1% when applying the DDSAE approach. This result is better than the two approaches in [
26] and [
27] by 27% and 33%, respectively. Overall, these results indicate that the DDSAE algorithm outperformed the state-of-the-art approaches that tend to learn the depth of an SAE in an unsupervised manner in terms of classification accuracy in balanced datasets, and in terms of F1-score in imbalanced datasets. Moreover, DDSAE outperformed the state-of-the-art approaches in the resulting architecture on the imbalanced datasets. The accuracy, F1-score, precision, and recall achieved on Parkinson’s and Breast Cancer are displayed in
Figure 15 and
Figure 16, respectively.
Real-world data exhibit generally overlapping and nonlinearly separable inter-class boundaries. This constitutes a challenge to machine learning models and thus affects drastically their performance. Accordingly, a nonlinear mapping can alleviate the problem. In particular, stacked AutoEncoders address the problem by automatically learning a nonlinear representation of the data. Specifically, they are trained in an unsupervised way to lean a meaningful representation of the data by optimizing the reconstruction error. However, their performance is drastically affected by their depth which impedes their broad usage. As such, learning the depth of the network while training it boosts the usage of AutoEncoder for real datasets.
Another significant comparison measure to report on is the testing time of the two approaches compared to the DDSAE algorithm.
Table 10 depicts the testing time of all approaches on the four datasets. As can be observed, DDSAE is faster than the first and second approach. This is due to the fact that the DDSAE algorithm learned an optimal architecture with a reduced number of layers compared to the other approaches. As such, the depth plays an important role in the inference time.
However, in terms of training time, as depicted in
Table 11, the DDSAE approach takes significantly longer than the state-of-the-art approaches. This is due to the fact that the proposed approach starts with an over-estimation of the number of layers and subsequently prunes non-relevant layers while training the model, whereas the learned optimal number of layers on the considered datasets converges to one layer. Nonetheless, since the training is done offline, it is insignificant for real-time applications that employ the previously trained model.