1.2.1. Deep Learning Approach
In Deep Learning, one of the most cutting-edge technologies used is Soft-Attention, as stated in [
1]. Soumyyak et al. constructed several models formed by a combination of a backbone model including DenseNet201 [
4], InceptionResNetV2 [
6], ResNet50 [
11,
12], VGG16 [
27] and Soft-Attention layer. Their approach adds the Soft-Attention layer at the end or the middle of the backbone model. For ResNet50 and VGG16, the Soft-Attention layer is added after the third residual block and CNN block, respectively. DenseNet201 and InceptionResNetV2 then concatenate with Soft-Attention before a fully-connected layer and then soft-max layer. Soumyyak et al.’s proposed method gained great performances and also outperformed many other studies with an accuracy of 0.93 and a precision of 0.92. However, using data augmentation on an imbalanced dataset resulted in subpar classification classify with respect to the classes; therefore, their model obtained a recall and F1-score of 0.71 and 0.75, respectively. In this research, our proposed method also considers this problem and solves it.
Using the above-mentioned backbones has been attempted previously. Rishu Garg et al. [
14] used a transfer learning approach with a CNN-based model: ResNet50 and VGG16 which are pretrained with an ImageNet data set. In addition, they also use data augmentation to avoid an imbalance occurring in the data set. Histogram equalization is also used to increase the contrast of skin lesions before being fed into machine learning algorithms including Random Forest, XGBoost, and Support Vector Machine. Histogram equalization can be considered as a heat map that takes the main feature as the number of occurrences of the same value pixel. This approach also gain great performances with an accuracy of 0.90 and precision of 0.88. However, this approach can be biased since only one skin image of the dataset contains the skin lesion at the center and the background skin, and the histogram may treat the background with increased numbers of occurrence with respect to the same pixel value. In this research study, our proposed method used Soft-Attention, which can create a heat map feature of the lesion. Otherwise, Rishu Garg et al.’s proposed method also faced the problem of imbalanced classification due to an imbalanced dataset with the F1-score and recall of 0.77 and 0.74, respectively.
Instead of using the entire imbalanced data set, Abayomi-alli et al. decided to separate the dataset into two subsets: one contains only melanoma and the other one contains the rest [
24]. Before feeding the data to classify melanoma, training data are then augmented by the SMOTE method. SMOTE creates artificial instances by oversampling the minority class. SMOTE recognizes k-minority class neighbors that are near each minority class sample by using the covariance matrix. This approach obtained an accuracy, recall, and F1-score of 0.92, 0.87, and 0.82, respectively.
Amirreaza et al. [
15] did not only use the backbone model mentioned above but also used the InceptionV3 [
6] model. In this research study, datasets HAM10000 and
PH2 are combined to create an eight-class dataset. Before being fed into Deep CNN models, the image was resized to (224, 224) for DenseNet201, ResNet152, InceptionResNetV2, and (229, 229) for InceptionV3. The best AUC values for melanoma and basal cell carcinoma are 0.94 (ResNet152) and 0.93 (DenseNet201).
Another paper that uses backbone models is [
16], in which Hemanth et al. decided to use EfficientNet [
28] and SeNET [
29] instead and the CutOut [
30] method, which involves creating holes of different sizes on images, i.e., technically making a random portion of image inactive during the data augmentation process. Although this approach obtained an accuracy of 0.88, it may be biased due to the CutOut method since this method can create a hole overlap in the skin lesion field. The method’s accuracy is also low due to the data-augmentation process.
Otherwise, Ref. [
17] also used a Deep Convolution Neural Network, and Peng Yao et al. used RandArgument, which crops an image into several images from a fixed size; DropBlock, which is used for regularization, Multi-Weighted New Loss, which is used for dealing with the imbalanced data problem; end-to-end Cumulative Learning Strategy, which can more effectively balance representation learning; and classifier learning, without additional computational costs. This approach obtained an accuracy of 0.86. Although this approach figureed out the data imbalance problem, the result of obtaining a low accuracy may due to RandArgument. If the skin lesion part of the image is quite big or small, the cropped image may only contain skin or the lesion is spread out in the entire image.
Another state-of-the-art method is GradCam and Kernel SHAP [
18]. Kyle Young et al. created an agnostic model, which includes local interpretable methods that can highlight pixels that the trained network deems relevant for the final classification. In that research study, they used three datasets containing HAM10000, BCN-20000, and MSK. Before feeding into the models, images are preprocessed by binarization with a very low threshold to find the center of mass. This approach obtained an AUC of 0.85.
On the other hand, there are also many state-of-the-art methods with great performance on skin lesion classification. The Student-and-Teacher Model is also a high-performance model introduced in 2021 [
19], and it is created by Xiaohan Xing et al. as a combination of two models that share memories with the other model. Therefore, the models can take full advantage of what others learn. The Student-and-Teacher model obtained an accuracy of 0.85; however, the precision and F1-score are quite low, resulting in a value of 0.76.
SkinLinkNet [
20] and WonderM [
21] are both tested the effect of segmentation on skin lesion classification problems created by Amirreza et al. and Yeong Chan et al., respectively. In WonderM, the method used is to pad the image so that the image has an increase in shape from (450, 600) to (600, 600). In SkinLinkNet, the image is instead resized down to (448, 448). Both SkinLinkNet and WonderM used UNet to perform the segmentation task, although they used EfficientNetB0 and DenseNet to perform the classification task. This approach obtained an AUC of 0.92.
Another approach is to use metadata, including gender, age, and capturing positions, as stated in [
22] by Nil Gessert et al. Metadata are fed into a fully connected neural network after encoded into a one-hot vector. All missing data points with respect to age are set to 0. To overcome the missing data problem, the research study applied one-hot encoding to the group, but the initial validation resulted in poor performance then when numerical encoding was applied. The metadata are then fed into two block networks, each one containing a Dense Layer, Batch Normalization, am ReLU activation function, and a Dropout. After all the feature vectors were extracted, the image was then concatenated with the feature vector extracted from metadata. Otherwise, data augmentation was also applied. This approach obtained a recall of 0.74. The low recall may be due to the imbalanced data set.
Abnormal, skin lesion segmentation, on the other hand, also plays an important role in skin lesion classification. Nawaz et al. created a framework for Melanoma segmentation [
25]. Their proposed method is a Unet model but used DenseNet77 as the backbone, and all residual blocks were changed into dense block, which contains a sequence of Convolution and Average Pooling. This melanoma segmentation approach obtain an accuracy of 0.99. Kadry et al. used a Unet model with a VGG16 deep convolution layer by pooling on the skip connection. This approach can completely extract the entire lesion, although there was an overlap observed with hair. This approach obtained an accuracy of 0.97.
1.2.2. Machine Learning Approach
In Machine Learning, there are also many approaches. Since the image’s data are quite complex for machine learning algorithms, using feature extractors or feature preprocessing for transformation to another form of data is recommended.
Random Forest, XGBoost, and Support Vector Machines are tested by [
14] of Rishu Garg et al. In this approach, the data are fed directly into the Machine Learning algorithm and shows no promising results; therefore, Rishu Garg et al. did not show the results of the used machine learning algorithm.
In addition, Deep Isolation Forest is applied before the soft-max activation of the deep learning model to detect the distribution of skin lesion images, as stated in [
31] by Amirreza Rezvantalab et al. In the Deep Isolation Forest, an feature extractor is applied by using CNN to learn the main pattern of the image. After that, the feature map is then fed into K isolation forest estimators by using bagging algorithms. The Deep Isolation Forest obtained an accuracy of 0.9 and a confidence of 0.86. However, the AUC is only 0.74, and this may due to the limitation of the machine learning algorithm.
Matrix transformation is also applied before the soft-max activation function in [
23] by Michele Alberti et al. In this approach, the image is fed into a general model by using a sequence of residual block. The feature maps created from those above the residual block is then fed into Global Average Pooling to create a feature vector. This feature vector is then extracted by CNN-1D and transformed by Discrete Fourier Transformation (DFT) as a filter before proceeding to the soft-max layer.