Figure 1.
Proposed architecture. Beginning with a pre-trained architecture on the ImageNet dataset, we remove the last fully connected layer to configure the network to produce 8 output neurons. Subsequently, the entire network is re-trained using the ISIC 2019 dataset.
Figure 1.
Proposed architecture. Beginning with a pre-trained architecture on the ImageNet dataset, we remove the last fully connected layer to configure the network to produce 8 output neurons. Subsequently, the entire network is re-trained using the ISIC 2019 dataset.
Figure 2.
Training procedure. This methodology is applied to each model within the study. The outputs of this process include a trained model and a latent space visualization of the test set.
Figure 2.
Training procedure. This methodology is applied to each model within the study. The outputs of this process include a trained model and a latent space visualization of the test set.
Figure 3.
During inference, the proposed methodology generates predictions by averaging the outputs obtained through Test Time Augmentation. Additionally, it provides a visual explanation of the predictions providing a saliency map. The regions highlighted in red represent the most significant areas of the image, with importance gradually decreasing toward the blue regions, which indicate the least significant areas. This color-coding scheme is consistent across all images presented in this paper.
Figure 3.
During inference, the proposed methodology generates predictions by averaging the outputs obtained through Test Time Augmentation. Additionally, it provides a visual explanation of the predictions providing a saliency map. The regions highlighted in red represent the most significant areas of the image, with importance gradually decreasing toward the blue regions, which indicate the least significant areas. This color-coding scheme is consistent across all images presented in this paper.
Figure 4.
Description of the Test Time Augmentation methodology. In the images, rotation is only used for visualization purposes. This is a more detailed version of the TTA branch of
Figure 3.
Figure 4.
Description of the Test Time Augmentation methodology. In the images, rotation is only used for visualization purposes. This is a more detailed version of the TTA branch of
Figure 3.
Figure 5.
ROC curves for all classes using EfficientNet-B6 without (left) and with (right) Test Time Augmentation. (a) Without Test Time Augmentation. (b) With Test Time Augmentation.
Figure 5.
ROC curves for all classes using EfficientNet-B6 without (left) and with (right) Test Time Augmentation. (a) Without Test Time Augmentation. (b) With Test Time Augmentation.
Figure 6.
Confusion matrices of EfficientNet-B6 without (left) and with (right) Test Time Augmentation. The values are normalized per row. (a) Without Test Time Augmentation. (b) With Test Time Augmentation.
Figure 6.
Confusion matrices of EfficientNet-B6 without (left) and with (right) Test Time Augmentation. The values are normalized per row. (a) Without Test Time Augmentation. (b) With Test Time Augmentation.
Figure 7.
Images and corresponding Grad-CAM outputs for the best-performing networks. From left to right: ResNeXt-50, ResNet-152, EfficientNet-B4, EfficientNet-B5, EfficientNet-B6, and EfficientNet-B7. Green border indicates the correct classification. (a) An image of actinic keratosis (AK). All neural networks correctly classify the image, despite focusing on different parts of the image. (b) An image of basal cell carcinoma (BCC). All predictions are made by considering the diagnostically significant regions of the image. However, some networks, such as ResNeXt-50 and ResNet-152, are also influenced by less relevant pixels. (c) An image of benign keratosis (BKL). All neural networks correctly classify the image, although EfficientNet-B6 and EfficientNet-B7 partially disregard certain areas of the lesion. (d) An image of dermatofibroma (DF). The regions of the image that determine the classification vary between different models.
Figure 7.
Images and corresponding Grad-CAM outputs for the best-performing networks. From left to right: ResNeXt-50, ResNet-152, EfficientNet-B4, EfficientNet-B5, EfficientNet-B6, and EfficientNet-B7. Green border indicates the correct classification. (a) An image of actinic keratosis (AK). All neural networks correctly classify the image, despite focusing on different parts of the image. (b) An image of basal cell carcinoma (BCC). All predictions are made by considering the diagnostically significant regions of the image. However, some networks, such as ResNeXt-50 and ResNet-152, are also influenced by less relevant pixels. (c) An image of benign keratosis (BKL). All neural networks correctly classify the image, although EfficientNet-B6 and EfficientNet-B7 partially disregard certain areas of the lesion. (d) An image of dermatofibroma (DF). The regions of the image that determine the classification vary between different models.
Figure 8.
Images and corresponding Grad-CAM outputs for the best-performing networks. From left to right: ResNeXt-50, ResNet-152, EfficientNet-B4, EfficientNet-B5, EfficientNet-B6, and EfficientNet-B7. The red border indicates misclassification; the green border indicates correct classification. (a) An image of melanoma (MEL). The EfficientNet-B7 correctly classifies the image despite focusing on only a few pixels of the lesion. (b) An image of melanocytic nevus (NV). All neural networks correctly identify the lesion, but each network bases its decision on different regions of the image. (c) An image of squamous cell carcinoma (SCC). All EfficientNet models misclassify this image, ResNet CNNs are able to correctly classify the image. (d) An image of a vascular lesion (VASC). Although all models base their predictions on the pixels corresponding to the lesion, the larger models, ResNet-152 and EfficientNet-B7, misclassify the image.
Figure 8.
Images and corresponding Grad-CAM outputs for the best-performing networks. From left to right: ResNeXt-50, ResNet-152, EfficientNet-B4, EfficientNet-B5, EfficientNet-B6, and EfficientNet-B7. The red border indicates misclassification; the green border indicates correct classification. (a) An image of melanoma (MEL). The EfficientNet-B7 correctly classifies the image despite focusing on only a few pixels of the lesion. (b) An image of melanocytic nevus (NV). All neural networks correctly identify the lesion, but each network bases its decision on different regions of the image. (c) An image of squamous cell carcinoma (SCC). All EfficientNet models misclassify this image, ResNet CNNs are able to correctly classify the image. (d) An image of a vascular lesion (VASC). Although all models base their predictions on the pixels corresponding to the lesion, the larger models, ResNet-152 and EfficientNet-B7, misclassify the image.
Figure 9.
Examples of wrongly classified images together with their corresponding Grad-CAM heatmaps. In order, from top-left to bottom-right: two nevus cases classified as BKL, BKL classified as melanoma, and BCC classified as melanoma.
Figure 9.
Examples of wrongly classified images together with their corresponding Grad-CAM heatmaps. In order, from top-left to bottom-right: two nevus cases classified as BKL, BKL classified as melanoma, and BCC classified as melanoma.
Figure 10.
Latent spaces of the best-performing networks for each backbone architecture. Images belonging to the same class are positioned closely in the embedding space. Clusters corresponding to diseases with similar visual features are situated nearer to each other compared to clusters of visually distinct diseases. This behavior is consistent across all backbone architectures. (a) Latent space of ResNeXt50. (b) Latent space of ResNet152. (c) Latent space of EfficientNetB4. (d) Latent space of EfficientNetB5. (e) Latent space of EfficientNetB6. (f) Latent space of EfficientNetB7.
Figure 10.
Latent spaces of the best-performing networks for each backbone architecture. Images belonging to the same class are positioned closely in the embedding space. Clusters corresponding to diseases with similar visual features are situated nearer to each other compared to clusters of visually distinct diseases. This behavior is consistent across all backbone architectures. (a) Latent space of ResNeXt50. (b) Latent space of ResNet152. (c) Latent space of EfficientNetB4. (d) Latent space of EfficientNetB5. (e) Latent space of EfficientNetB6. (f) Latent space of EfficientNetB7.
Figure 11.
Qualitative analysis of the model’s performance. The images are organized such that rows represent the ground truth labels, while columns correspond to the predictions made by the EfficientNet-B6 model. In instances where there are no images for a particular ground truth–prediction pair, a grey placeholder image is displayed.
Figure 11.
Qualitative analysis of the model’s performance. The images are organized such that rows represent the ground truth labels, while columns correspond to the predictions made by the EfficientNet-B6 model. In instances where there are no images for a particular ground truth–prediction pair, a grey placeholder image is displayed.
Table 1.
This table outlines the key hyperparameters used uniformly across all backbone models during training. The configuration details are organized into three main sections: training, scheduler, and optimizer, providing a concise but comprehensive summary of the hyperparameter setup.
Table 1.
This table outlines the key hyperparameters used uniformly across all backbone models during training. The configuration details are organized into three main sections: training, scheduler, and optimizer, providing a concise but comprehensive summary of the hyperparameter setup.
Training | Scheduler ReduceLRonPlateau | Optimizer MADGRAD |
---|
Batch Size | Loss | Factor | Patience | Min LR | Learning Rate | Weight Decay | Momentum |
512 | WCE | 0.1 | 10 | | 0.001 | 0 | 0.9 |
Table 2.
The size of the input image and the dimensions of the input to the final convolutional layer are specified. Following this convolutional layer, a fully connected layer is appended to produce outputs corresponding to the eight skin disease classes, with each class representing a distinct condition.
Table 2.
The size of the input image and the dimensions of the input to the final convolutional layer are specified. Following this convolutional layer, a fully connected layer is appended to produce outputs corresponding to the eight skin disease classes, with each class representing a distinct condition.
Backbone | Image Size | w × h × c | Input Fully Connected |
---|
ResNeXt50 | 600 × 600 × 3 | 19 × 19 × 2048 | 2048 |
ResNet152 | 600 × 600 × 3 | 19 × 19 × 2048 | 2048 |
EfficientNet-B4 | 380 × 380 × 3 | 12 × 12 × 1792 | 1792 |
EfficientNet-B5 | 456 × 456 × 3 | 15 × 15 × 2048 | 2048 |
EfficientNet-B6 | 528 ×528 × 3 | 17 × 17 × 2304 | 2304 |
EfficientNet-B7 | 600 × 600 × 3 | 19 × 19 × 2560 | 2560 |
Table 3.
Parameters for geometric transformations used during data augmentation. This table outlines the probabilities and ranges for horizontal and vertical flips, scaling, rotation, and shearing applied to the input data to enhance model generalization and robustness.
Table 3.
Parameters for geometric transformations used during data augmentation. This table outlines the probabilities and ranges for horizontal and vertical flips, scaling, rotation, and shearing applied to the input data to enhance model generalization and robustness.
Random H. Flip | Random V. Flip | Random Scale | Random Rotation | Random Shear |
---|
Probability | Probability | From | To | From | To | From | To |
0.5 | 0.5 | 0.7 | 1.7 | 0 degrees | 359 degrees | −30 | 30 |
Table 4.
Parameters for color adjustments applied as part of data augmentation. These settings specify the maximum perturbations in brightness, contrast, hue, and saturation, enhancing model robustness to varying lighting and color conditions.
Table 4.
Parameters for color adjustments applied as part of data augmentation. These settings specify the maximum perturbations in brightness, contrast, hue, and saturation, enhancing model robustness to varying lighting and color conditions.
Brightness | Contrast | Hue | Saturation |
---|
0.2 | 0.2 | 0.05 | 0.05 |
Table 5.
Performance metrics across different model backbones with and without TTA. The table compares baseline accuracy with TTA-enhanced accuracy and highlights the number of predictions changed, corrections made, and errors introduced due to TTA. EfficientNet-B6 achieved the highest accuracy with TTA. The row corresponding to the model achieving the best performance is highlighted in bold for emphasis.
Table 5.
Performance metrics across different model backbones with and without TTA. The table compares baseline accuracy with TTA-enhanced accuracy and highlights the number of predictions changed, corrections made, and errors introduced due to TTA. EfficientNet-B6 achieved the highest accuracy with TTA. The row corresponding to the model achieving the best performance is highlighted in bold for emphasis.
Backbone | Accuracy | TTA | Predictions | Corrections | Errors | Difference |
---|
| | Accuracy | Changed | | Introduced | |
ResNeXt50 | 97.10% | 97.39% | 82 | 52 | 22 | 30 |
Resnet 152 | 96.94% | 96.94% | 93 | 40 | 40 | 0 |
EfficientNet-B4 | 97.12% | 97.27% | 105 | 55 | 39 | 16 |
EfficientNet-B5 | 97.06% | 97.39% | 106 | 65 | 31 | 34 |
EfficientNet-B6 | 97.31% | 97.58% | 102 | 59 | 32 | 27 |
EfficientNet-B7 | 97.04% | 97.30% | 113 | 67 | 40 | 27 |
Table 6.
Performance metrics for skin disease classification using the best model (EfficientNet-B6) without TTA. The table provides =Precision, Recall, F1 score, and accuracy for each disease class, showing strong overall performance. Mean values summarize the model’s effectiveness in predicting diverse skin disease categories under standard conditions without TTA enhancements.
Table 6.
Performance metrics for skin disease classification using the best model (EfficientNet-B6) without TTA. The table provides =Precision, Recall, F1 score, and accuracy for each disease class, showing strong overall performance. Mean values summarize the model’s effectiveness in predicting diverse skin disease categories under standard conditions without TTA enhancements.
Skin Disease | Precision | Recall | F1-Score | Accuracy |
---|
MEL | 0.86 | 0.78 | 0.82 | 93.8% |
NV | 0.92 | 0.94 | 0.93 | 93.00% |
BCC | 0.92 | 0.92 | 0.92 | 97.98% |
AK | 0.76 | 0.79 | 0.78 | 98.42% |
BKL | 0.83 | 0.87 | 0.85 | 96.8% |
DF | 0.73 | 0.79 | 0.76 | 99.52% |
VASC | 0.77 | 0.96 | 0.86 | 99.68% |
SCC | 0.91 | 0.79 | 0.85 | 99.28% |
Mean | 0.84 | 0.85 | 0.85 | 97.31% |
Table 7.
Performance metrics for skin disease classification using the best model (EfficientNet-B6) with TTA. The table reports precision, recall, F1 score, and accuracy for each disease class, highlighting balanced performance across classes. The mean values demonstrate the model’s overall effectiveness in handling a diverse set of skin disease categories.
Table 7.
Performance metrics for skin disease classification using the best model (EfficientNet-B6) with TTA. The table reports precision, recall, F1 score, and accuracy for each disease class, highlighting balanced performance across classes. The mean values demonstrate the model’s overall effectiveness in handling a diverse set of skin disease categories.
Skin Diseases | Precision | Recall | F1-Score | Accuracy |
---|
MEL | 0.89 | 0.79 | 0.84 | 94.59% |
NV | 0.93 | 0.95 | 0.94 | 93.68% |
BCC | 0.93 | 0.93 | 0.93 | 98.14% |
AK | 0.77 | 0.84 | 0.80 | 98.58% |
BKL | 0.84 | 0.87 | 0.86 | 97.00% |
DF | 0.83 | 0.79 | 0.81 | 99.64% |
VASC | 0.79 | 92 | 0.85 | 99.68% |
SCC | 0.91 | 0.81 | 0.86 | 99.33% |
Mean | 0.86 | 0.86 | 0.86 | 97.58% |
Table 8.
The inference time for each model, with and without TTA, is evaluated using a CPU. Notably, although TTA involves 16 different versions of each input image, the execution time does not scale linearly by a factor of 16. This efficiency is achieved through an optimized method of feeding the images into the model. However, achieving a similar level of efficiency is more challenging when employing ensemble methods.
Table 8.
The inference time for each model, with and without TTA, is evaluated using a CPU. Notably, although TTA involves 16 different versions of each input image, the execution time does not scale linearly by a factor of 16. This efficiency is achieved through an optimized method of feeding the images into the model. However, achieving a similar level of efficiency is more challenging when employing ensemble methods.
Model | No TTA (s) | TTA (s) |
---|
EfficientNetB4 | 0.66 ± 0.22 | 1.44 ± 0.09 |
EfficientNetB5 | 1.11 ± 0.26 | 6.76 ± 0.64 |
EfficientNetB6 | 1.66 ± 0.3 | 11.82 ± 0.87 |
EfficientNetB7 | 2.46 ± 0.33 | 15.64 ± 0.95 |
ResNext50 | 0.43 ± 0.17 | 4.68 ± 0.45 |
Resnet152 | 0.66 ± 0.22 | 4.02 ± 0.25 |
Table 9.
Performance comparison with current state-of-the-art methods. The method achieving the highest test accuracy for each dataset is highlighted in bold. For the ISIC 2019 dataset, the proposed approach demonstrates either superior or comparable performance relative to other methodologies, including comparisons with more recent architectures such as the ViT.
Table 9.
Performance comparison with current state-of-the-art methods. The method achieving the highest test accuracy for each dataset is highlighted in bold. For the ISIC 2019 dataset, the proposed approach demonstrates either superior or comparable performance relative to other methodologies, including comparisons with more recent architectures such as the ViT.
Study | Dataset(s) | Methodology | Results | Skin Classes | xAI Method | Train:Val:Test |
---|
[35] | HAM10000 | Custom CNN | Acc 82.7% | 7 Classes | CAM | 80%:10%:10% |
[36] | ISIC 2018 | CNN | ROC-AUC 94% | 7 Classes | Backpropagation | Not Available |
[56] | ISIC 2017 | CNN | - | 3 Classes | CAM | Not Available |
[38] | ISIC 2018 | VGG16+ResNet50 | Acc 85% | 7 Classes | Occlusion | 70%:10%:20% |
[78] | ISIC 2019 | VGG16+ResNet50 | Acc 72.2%, 76.7% | 8 Classes | GradCAM | Not Available |
[79] | 1021 images | ResNet50 | Acc 60.94% | 4 Classes | CBIR | Not Available |
[10] | ISIC 2017, PH2 | Modified deep CNN | Acc 90.4% | 3 Classes | CAM | 2000:150:600 |
[57] | ISIC 2017 | ResNet50 | Acc 83% | 2 Classes | CAM | 2000:150:600 |
[80] | HAM10000 | Inception | Acc 85% | 2 Classes | GradCAM, Kernel SHAP | Not Available |
[58] | ISIC 2016 | VGG16 | ROC-AUC 81.18% | 2 Classes | CAM | 900:NA:379 |
[12] | ISIC 2018 | CNN | Spec 86.5% | 1 Class | No | 12378:1259:100 |
[11] | ISIC 2019 | Deep CNN | Acc 94.92% | 8 Classes | No | 80%:10%:10% |
[41] | ISIC 2018 | ResNet50 | Acc 92% | 7 Classes | No | 70%:NA:30% |
[42] | PH2 | CNN | Acc 93% | 3 Classes | No | 70%:20%:10% |
[43] | ISIC 2018 | inception V3+ResNet50 | Acc 89.9% | 7 Classes | No | 80%:20%:NA |
[44] | ISIC 2018 | Deep CNN | Acc 96.67% | 2 Classes | No | Variable |
[45] | PH2, ISBI 2017 | YOLO, Grab Cut | Acc 93.39% | 3 Classes | No | 2000:150:600 |
[46] | ISIC 2018 | Custom CNN | Acc 87.5% | 7 Classes | TSNE | 10015:193:1512 |
[47] | HAM10000 | Custom CNN | Acc 91.6% | 7 Classes | No | 8012:NA:2003 |
[48] | HAM10000 | Custom CNN | Acc 95,73% | 7 Classes | No | 80%:NA:20% |
[49] | HAM10000 | Custom CNN | Acc 97.96% | 7 Classes | No | 80%:NA:20% |
[50] | HAM10000 | Custom CNN | Acc 96.12% | 7 Classes | No | 8010:NA:2005 |
[62] | ISIC 2019 | CNN | Acc 96.12% | 8 Classes | No | 75%:0%:25% |
[63] | ISIC 2020 | Custom model | Acc 93.75% | 2 Classes | No | 2302:NA:989 |
[61] | ISIC 2019 | Custom model | Acc 96.22% | 8 Classes | Interpretable model | 10-fold cv |
[66] | ISIC 2019 | Trasformers model | Acc 97.48% | 8 Classes | No | 80%:10%:10% |
[67] | ISIC 2019 | ViT | Acc 96.97% | 8 Classes | No | Not Available |
[67] | ISIC 2020 | ViT | Acc 97.73% | 2 Classes | No | Not Available |
[69] | ISIC 2020 | CNN + Classifier | Acc 88% | 2 Classes | GradCam | 70%:0%:30% |
Ours | ISIC 2019 | Deep CNN | Acc 97.31% | 8 Classes | GradCam, TSNE | 80%:10%:10% |
Ours(TTA) | ISIC 2019 | Deep CNN | Acc 97.58% | 8 Classes | No | 80%:10%:10% |