The research involved the application of various machine learning (ML) algorithms to construct classification models. A series of tests were conducted to evaluate the performance of each model. Throughout the training phase, accuracy and loss were recorded for each epoch, leading to a graphical representation of these results. Once assessing the models with the test dataset, an initial outcome emerged in the form of a confusion matrix, depicting the correlation between predicted labels and true labels. Following this, two tables were generated, containing essential metrics. One table presented the relationship between weighted and macro averages and metrics such as precision, recall, F1 score, and support. Similarly, the other table illustrated the association between class labels (0 or 1) using the same metrics, mirroring the structure of the first table.
From here, two more plots were obtained, the ROC (receiver operating characteristic) and the precision–recall curve. Both these plots resulted from the variation of the classification threshold to assess how the true positive rate (TPR) over the false positive rate (FPR) and how the precision (recall) varied for the ROC and precision–recall curves varied, respectively.
The original classification threshold was set to 0.5 to distinguish between the positive and the negative classes from which the confusion matrix was derived. For each tested model, a table presenting the accuracy loss AUC (area under the curve) and cross-validation results was created.
3.4.1. GoogLeNet
The GoogLeNet Inception V1 deep neural network, composed of five main blocks, was used for the study [
28]. For all tests, the layers, padding, strides, and the activation function were maintained with the following values: same padding, (2,2) and a non-linear activation function, rectified linear unit (ReLU) as any exception to this will be highlighted.
The first block had three 2D convolution layers (2DConvs), whereas the first layer had 64 filters with a dimension of 7 × 7 pixels followed by a Max pooling layer with a kernel of 3 × 3 [
29]. After this layer, two more 2DConvs with 64 and 192 as the number of filters, with dimensions of 1 × 1 and 3 × 3, respectively. This block ended with a 2D Max pooling layer, with the same parameters as before, that served to reduce the spatial dimensions and increase the number of channels, preparing the extracted feature maps for the inception blocks.
The first couple of inception modules had the specified array with the number of filters for the inception module: (64, 96, 128, 16, 32, 32) and (128, 128, 192, 32, 96, 64), applied sequentially, followed by a 2D Max pooling layer with 3 × 3 kernel size.
The next block was composed of five inception modules with the following parameters for the number of filters: (64, 96, 128, 16, 32, 32) and (128, 128, 192, 32, 96, 64), applied sequentially, followed by a 2D Max pooling layer with a 3 × 3 kernel size. This block extracted a wide variety of features, which could be described as mid-level features.
The next block was composed of five inception modules with the following parameters for the number of filters: (192, 96, 208, 16, 48, 64), (160, 112, 224, 24, 64, 64), (128, 128, 256, 24, 64, 64), (112, 144, 288, 32, 64, 64), and (256, 160, 320, 32, 128, 128). At the end of this block, there was a 2D average pooling layer with a 3 × 3 kernel size. This block extracted a wide variety of features, which could be described as mid-level features.
The final block had a flattened layer to convert the 2D tensor into an array, followed by a dropout layer with a value of 0.4, and a dense layer with a sigmoid function to perform the binary classification task, having two neurons for that purpose. In this case, the images on the datasets were not pre-processed, i.e., the images on the datasets were the original ones.
The inception module was made of one 1 × 1 2DConv, two 3 × 3 2DConvs, two 5 × 5 2DConvs, one 3 × 3 2D Max pooling layer, and one 1 × 1 2DConv. The output of each couple of layers was then concatenated. Furthermore, for this neural network to yield relevant results in terms of learning the image patterns and generalizing for unseen cases, it needed to be regularized using the L1 regularizer with a value of 0.000001 for every trainable layer. In terms of training hyperparameters, the learning rate was set to 0.0001, with 20 training epochs and a batch size of 32. The accuracy for each epoch is presented in
Figure 3a for both training and validation datasets.
The loss function determined in each epoch is also presented in a plot,
Figure 3b, for both the training and validation datasets. The analysis of the values resulting from the training process, regarding the curve for the training dataset, showed that the accuracy maintained a value between 0.6 and 0.7 until the 15th epoch. Until the last epoch, the accuracy varied, reaching a final value of almost 0.8. On the other hand, on the 17th epoch, the validation set registered a global minimum for accuracy, but then reached a final value of 0.6 for the last epoch. It is possible to see on the loss graph, in
Figure 3b, that for the training dataset, the loss decreased, and for the validation dataset, the loss increased, especially across the last epochs. This plot shows that some overfit was registered because the accuracy for the training dataset increased while decreasing for the validation dataset. On the loss graph, the overfit is evident by the increase in that value in the validation dataset. The resulting confusion matrix showing the values obtained from training the model for the GoogLeNet neural network is presented in
Figure 4, illustrating the values obtained during the training of this model with GoogLeNet neural network.
Taking such values results in the next table, with the metrics (precision, recall, F1 score, and support) as shown for the Classes 0 and 1, as negative and positive for cancer.
In
Table 3, the same metrics (precision, recall, F1 score, and support) are calculated with the macro and weighted average, as explained in the results section. The calculated metrics presented in
Table 3, for Class 0, show a high recall, meaning that the number of true negatives classified was high (consequence of a low number of false positives) and a moderately high precision, showing that the number of false negatives identified was large. For Class 1, the precision was high, indicating that most of the images predicted, for that class, are true positives, and the recall was lower, indicating a high number of false negatives. The precision and recall for both classes were corroborated with each other and confirmed with the information displayed on the confusion matrix.
The F1 score of both classes considered the harmonic mean of the previous two metrics. This metric tells us that, for Class 0, the model’s performance in predicting cases from that class was reasonably effective. Instead, for Class 1, the F1 score shows that the balance between precision and recall was worse, caused by the lower recall value. Overall, this meant that this model performed better in classifying images from Class 0 than for Class 1.
Also in
Table 3, the macro and weighted average values were calculated from the previous metrics. The macro average did not consider the class number when calculating that average. Instead, the weighted average considered the class weight (the number of instances).
In
Figure 5a, it is possible to observe the ROC curve plotted, where the AUC (area under the curve) has a value of 0.88. The dashed blue line indicates where a random classifier would eventually fall. In the ROC curve, when the classification threshold is high, the number of false positives is low, given by the false positive rate (FPR) initial low value. The true positive rate (TPR) starts at a low value of less than 0.2, meaning that there are not so many true positive cases identified. Reaching a certain FPR value and, therefore, a higher classification threshold, leads to a higher TPR value. So, in this case, the number of true positives increases. Considering these observations, it was possible to conclude that this model could be more certain in classifying the positive instances. Until the end of the graph, the curve has a crescent tendency, plateauing at certain FPR values, suggesting that the threshold variation during a fixed TPR does not change the number of true positives. Thus, there are certain gaps during the threshold variation where no instances have been classified. In the end, the number of true positives was high, while the number of false positives is high as well, indicating that the classification threshold reached a low value. Therefore, most of the cases were classified as positives, because there was a high number of both false and true positives.
In
Figure 5b, the precision–recall curve is plotted. On this curve, the classification threshold starts at a high value and decreases until the end of the graph alongside the
x-axis. For the initial x values, those reach a threshold where the precision drops below 0.8 for a recall value close to 0.2. This means that the number of false positives increases explicitly with the precision, and, for a relatively high threshold resulting from a high number of false negatives, explicitly by the recall calculation. This is something not normal because it is expected to have only positive instances with a high predicted value, close to 1, and no negative instances. However, that is a punctual phenomenon, since the curve rises in terms of precision after that point.
After that, along with the decrease in the classification threshold, and with the decrease in the recall, given by the decrease in the number of false negatives, the precision drops again for a recall with a value of approximately 0.7, until the recall reaches the value 1. This behavior shows that the number of false positives increases explicitly with the precision, as the number of false negatives decreases, as obtained by the recall. In this case, the classification threshold has a lower value, which leads to some negative cases with a predicted value between the current classification threshold and 0.5 being classified as positives, because the instance predicted value is higher than the frontier (classification threshold) between the two classes. This is the reason why some negative instances will be false positives.
3.4.2. VGG Net
In this case, the better suited VGG net for this specific classification task was VGG-11 [
30]. Throughout the test, this VGG net was the one that performed better in terms of test accuracy and loss. On this neural network (NN), the parameters of the dense layers needed replacement with more convenient values for this case because the original values for those layers considered more output classes and more complex tasks. In practice, the original 4096 neurons on those layers were replaced with 64 and 128 neurons, respectively, with the last layer adjusted for the output prediction of only two neurons, i.e., the output classes. ReLU was the activation function for most of the layers while using the same padding.
Considering the five blocks preceding the last layer, the first was composed of a 2DConvs with 64 filters with a dimension of 3 and a 2D Max pooling one with 2 × 2 as the kernel size. The second had 128 filters with a kernel with the same dimensions as described before, and a 2D Max pooling layer with the same parameters as described for the previous one. The third block had two 2DConv layers, both with the same number of filters, 256, and a 2D Max pooling. The kernel dimensions were the same for every Conv2D: (3 × 3) and (2 × 2) for the pooling ones. The fourth block had two 2DConvs with 512 filters, with a max pooling after those layers. The last block before the fully connected layers was the same as the fourth block.
On the fully connected layers, there was a flattened layer that preceded the two dense layers described at the beginning of this section. After each dense layer, there was a dropout with 0.5 as its parameter value. The output layer had two neurons and a sigmoid as the activation function to perform the binary classification task. In this case, the l1 regularization technique was used with 0.00001 as its parameter, which was applied to every trainable layer. The objective of using this technique was to make the network ignore characteristics with less meaning for the classification, thus preventing overfitting [
31].
The plot depicting the delta loss to the epochs was calculated through the difference between the total and L1 losses. The used images on the datasets are the originals, without any pre-processing.
Figure 6a presents the variation in the accuracy concerning the epochs during the training phase, for the training and validation datasets.
On the other hand,
Figure 6b represents the delta loss variation as a function of the epochs during the training phase for the training and validation datasets. For the VGG-11 net, the training graph shows that the accuracy and loss for the training dataset followed the tendency of the same curves (accuracy and loss) for the validation dataset. For this reason, this network was generalized to classify unseen cases.
Figure 7 presents the corresponding confusion matrix, obtained using VGG-11 net with
Table 4 presenting the metrics (precision, recall, F1 score, and support) for both classes, 0 and 1 (negative and positive for cancer), respectively.
For Class 0, high recall (high number of true negatives classified—a consequence of a low number of false positives) and high precision (high number of false negatives) values were found. For Class 1, the precision was high, indicating that most of the images predicted, for that class, were true positives, with some false positives. On the other hand, recall was lower, indicating a high number of false negatives. The precision and recall for both classes are corroborated with each other and are related with the information displayed in the confusion matrix. The F1 score of both classes considered the harmonic mean of the previous two metrics. This metric tells us that for Class 0, the model’s performance in predicting cases from that class was effective. Instead, for Class 1, the F1 score shows that the balance between the precision and recall was worse, caused by a lower recall. Overall, this meant that this model performed better in classifying images from Class 0 than from Class 1.
Table 4 also displays the calculated macro and weighted average values resulting from the previous metrics. The macro average did not consider the class number when calculating that average, so the precision, recall, and the F1 score calculated for that case were simply the averages. Instead, the weighted average considered the class weight (the number of instances) followed by the calculated precision, recall, and F1 score, which considered the number of instances in each class. Considering that Class 0 had more instances, its impact when calculating the weighted average was larger.
In
Figure 8a, the plotted ROC curve has an AUC value of 0.82. The plotted ROC curve is similar to the previous one explained for the GoogLeNet. This has a low TPR for a low FPR, indicating a low number of true positive cases identified at a high classification threshold. When the classification threshold decreases, naturally, more true positives will be identified, reflected in the increase in the TPR value from close to 0.2 to close to 0.7. From there on, the TPR value does not increase so much anymore as the number of false positives identified increases until the end.
In
Figure 8b, the precision–recall curve is plotted. Looking at the graph, there is a certain threshold where the precision drops below 0.6, for a recall value close to 0.1. This meant that the number of false positives increased, explicitly with the precision, for a relatively high threshold and for a high number of false negatives by the recall.
After that, concomitantly with the decrease in the classification threshold, and with the decrease in the recall, given by the decrease in the number of false negatives, the precision dropped, again, for a recall with a value of approximately 0.8, until the recall reached the value 1. This behavior shows that the number of false positives increased, as seen by the precision values, as the number of false negatives decreased from the recall.
3.4.4. First Optimized Convolutional Neural Network (CNN1)
This convolutional neural network (CNN) was optimized using a random search algorithm [
33,
34]. To perform this optimization, the library used was scikit-learn. Before the optimization process, an interval of possible values for a defined set of hyperparameters was defined. During the optimization process, random values were tested for the hyperparameters set before. The objective of this optimization was to find a set of hyperparameters that maximized the test accuracy. These hyperparameters referred to parameters of the layers of the architecture of these CNNs (for example, the number of convolutional filters, size of those filters, optimization functions, etc.…) and to other training parameters (batch size, training epochs, etc.…).
The architecture was composed of four sets of 2DConv layers and 2D Max pooling layers. Additionally, a set of fully connected layers, which had a flattened layer, 2 dense layers, a dropout one with a value of 0.5, and an output layer were set. Some parameters were maintained for those layers. The 2DConv had a valid padding, stride with 1 and activation function with ReLU, while the pooling layer had a dimension of 2 × 2.
The set of hyperparameter intervals was the following:
‘optimizer’: ‘SGD’, ‘Adam’, ‘Adagrad’, ‘RMSprop’.
‘1st layer number of filters’: 4, 8.
‘2nd layer number of filters’: 16, 32.
‘3rd layer number of filters’: 64, 128.
‘4th layer number of filters’: 512, 1024.
‘1st dense layer number of neurons’: 32, 64, 128, 256, 512.
‘2nd dense layer number of neurons’: 32, 64, 128, 256, 512.
‘2DConv layer kernel size’: 3, 5, 7.
‘epochs’: 10, 15, 20.
‘batch_size’: 32.
After the optimization process, the hyperparameters that maximized the accuracy were:
‘optimizer’: ‘Adam’.
‘1st layer number of filters’: 8.
‘2nd layer number of filters’: 32.
‘3rd layer number of filters’: 128.
‘4th layer number of filters’: 512.
‘1st dense layer number of neurons’: 32.
‘2nd dense layer number of neurons’: 64.
‘2DConv layer kernel size’: 3.
‘epochs’: 10.
‘batch_size’: 32.
In this case, the pre-processing method used was the method previously described in the pre-processing section. In
Figure 10a, it is possible to see the variation of the accuracy to the epochs, during the training phase for the training and validation datasets.
Figure 10b represents the loss variation as a function of the epochs, during the training phase for the training and validation datasets.
According to
Figure 10, for CNN1, both the accuracy and loss for the training and validation datasets followed the same tendency, in terms of the numeric values registered for those metrics. During the training phase, this neural network was able to classify correctly the majority of the images in the validation dataset, which was not seen before. Consequently, this network learned the implicit patterns adequately.
Figure 11 shows the corresponding confusion matrix, explicitly clarified in the results section.
As shown in
Table 5, for Class 0, all metrics (precision, recall, and the F1 score) were above 0.9. This precision tells us that all instances classified as negatives were correctly identified, so they were all true negatives. Consequently, there were no false negatives. The recall value meant that some false positives were identified by the classification model. For Class 1, the precision value translated into the classification of a small number of false positives. The recall value shows that no false negatives were identified by the classification model. The precision and recall for both classes corroborate with each other and corroborate with the information displayed on the confusion matrix, supported by the existence of only three false positives. Analyzing the F1 score value, this metric shows that the classification model classified negative cases slightly better than the positives ones, because for Class 0, the value was slightly higher as compared to that found for Class 1.
Also in
Table 5, both the macro and weighted averages were calculated in consequence of the previous metrics. The macro average did not consider the class number when calculating that average. Instead, the weighted average considered the class weight (the number of instances). Hence, these average values correlate with the values calculated on the table and displayed on
Figure 11.
In
Figure 12a, the corresponding ROC curve is plotted, showing an AUC value of 0.96, mirroring the data from
Table 5. The plotted ROC curve represents a case where the TPR value increases till the unitary value, for a low FPR, less than 0.2. This indicated that most of the positive instances had a value predicted by the model very close to each other. Implicitly, this model learned to distinguish both classes with a high degree of confidence and accuracy.
In
Figure 12b, the precision–recall curve is shown. In this plot, the precision–recall curve shows that the precision value had some plateaus and some drops in its value. This meant that at certain recalls and, therefore, at some classification thresholds, the number of false positives increased, but not by as many cases as the other models tested, because the precision drops were lower than the drops registered for those classification models. Besides that, the precision value was maintained higher than 0.8, indicating a good distinction between both classes. In the end, naturally, the precision value fell abruptly for a recall with value 1, since the number of false positives increased, caused by the precision value, and the number of false negatives decreased, given by the recall value.
3.4.5. Second Optimized Convolutional Neural Network (CNN2)
In this case, the CNN was optimized using the Bayes optimization and the tree-structured parzen estimator (TPE) process and denoted as CNN2 [
35]. As was done for the random search in CNN1, this time a new set of hyperparameters was set as the interval of each one. The set of hyperparameters consolidated not only the parameters corresponding to the numerical parameters of each layer but the number of layers (convolutional and dense), giving a broader or larger space for looking for the best set of hyperparameters that maximized accuracy.
The hyperparameters interval of values are:
‘1st layer number of filters’: 4, 8, 16, 32.
‘2DConv layer kernel size’: 3, 5, 7.
‘Number of Conv and pooling layers (except the input layer)’: 1, 2, 3.
‘2nd layer number of filters’: 8, 16, 32, 64.
‘3rd layer number of filters’: 32, 64, 128.
‘4th layer number of filters’: 128, 256, 512, 1024.
‘Number of fully connected layers: 1, 2, 3, 4, 5.
‘Number of neurons for the dense layers’: 32, 64, 128.
‘dropout_rate’:0.2, 0.5;
‘Learning rate’: from 1 × 10−5 to 1 × 10−1.
After the optimization the hyperparameters determined to maximize the accuracy were:
‘1st layer number of filters’: 8.
‘2DConv layer kernel size’: 3.
‘Number of Conv and pooling layers (except the input layer)’: 2.
‘2nd layer number of filters’: 8.
‘3rd layer number of filters’: 64.
‘Number of fully connected layers: 3.
‘Number of neurons for the first dense layers’: 32.
‘Number of neurons for the second dense layers’: 64.
‘Number of neurons for the third dense layers’: 64.
‘dropout_rate’:0.5.
‘Learning rate’: 0.00233.
The library used to implement this optimization process was Optuna. The images were pre-processed using the second method described in the respective section. In
Figure 13, it is possible to see the variation in the accuracy and the loss across the epochs, during the training phase for the training dataset.
During the CNN2 training phase, the accuracy and loss results for the training dataset indicated that the learning process was successful, as shown by the increase in the accuracy and the corresponding decrease in the loss till the last epoch. One more thing that supported this claim was that the test results for the accuracy and loss corresponded to the accuracy and loss registered on the plot in the last epoch. This meant that the network could generalize to unseen cases, based on the training dataset images.
Figure 14 shows the corresponding confusion matrix arising from these results, explicitly displaying the results of the classification.
As evidenced in
Table 6, for Class 0, all metrics (precision, recall, and the F1 score) were above 0.9. This precision tells us that all instances classified as negatives were correctly identified, so they were all true negatives and, consequently, there were no false negatives. The recall value meant that some false positives were identified by the classification model. For Class 1, the precision value translated into the classification of a certain number of false positives. The recall value showed that no false negatives were identified by the classification model. The precision and recall for both classes corroborated with each other and with the information displayed in the confusion matrix (
Figure 14), given by the existence of only two false positives. Analyzing the F1 score, this metric shows that the classification model classified negative cases slightly better than the positives, given that for Class 0 the value was slightly higher, compared to that found for Class 1.
Also in
Table 6, both the macro and weighted averages were calculated in consequence of the previous metrics, with all values close to unity.
In
Figure 15a, the plotted ROC curve reached an AUC value of 0.99. Just like the ROC curve plotted for CNN2, this case shows an even better distinction between the two classes, because of the TPR value sharp rise at low FPR values.
In
Figure 15b, the precision–recall curve shows an almost horizontal curve between 0 and 0.8, for the recall. This means that no false positive cases have been classified when the recall increased along with the classification threshold decrease. For recall values above 0.8, some false positive cases were identified (
Figure 14). This curve profile suggests that the two classes were well distinguished by the classification model.