1. Introduction
Less than 0.2% of all cancer cases are predominant bone cancers, which are exceptionally infrequent tumors whose true incidence is difficult to ascertain due to their rarity [
1]. The three predominant forms of bone cancer are osteosarcoma, chondrosarcoma, and Ewing sarcoma. The histological lineage of different bone cancer types determines their nomenclature. Osteosarcomas arise from bone tissue; a chordoma originates from notochordal tissue; and chondrosarcomas emerge from cartilage tissue. Primary bone cancers exhibit significant clinical variability and are frequently curable when given appropriate care. The incidence of bone cancers exhibits variations in both sex and age. With the highest prevalence in the fifth to sixth decades of life, chordoma is more prevalent in men. Adults who are middle-aged or older are chondrosarcoma carriers, and younger generations and children are barriers to Ewing sarcoma and osteosarcoma. The tumor leads to significant skeletal transformation, cracks, distress, and malnutrition once it has spread to the bone, making it a leading cause of mortality and morbidity. Patients diagnosed with advanced breast, prostate, and lung cancer often encounter bone cancer discomfort due to the notable tendency of these malignancies to metastasize to the skeletal system [
2]. Osteosarcoma stands in the eighth position among all cancers in children. It usually starts in the bone cells, forming new bone tissue, and can develop in any bone in the body. However, it most commonly occurs in the long legs and arms. The percentage of the most frequent sites of osteosarcoma is 42% for the femur, 19% for the tibia, and 10% for the humerus [
3]. The 10- to 14-year-olds experience the first peak, and adults over 65 experience the second. Per year, 3 million people are affected by osteosarcoma. However, the age group of 15 to 19 is more affected by the health problem. In general, the incidence rate of females is lower than that of males [
4].
Symptoms of osteosarcoma can include pain, swelling, stiffness in the affected bone, and difficulty moving the affected limb. A mass or lump may be visible on or near the affected bone. The etiology of osteosarcoma remains uncertain, although certain risk factors have been identified, including a prior history of radiation therapy, the presence of specific genetic disorders such as Li-Fraumeni syndrome, and a previous diagnosis of Paget’s disease. Spinal osteosarcoma is an aggressive form of bone cancer primarily affecting the spine. Compared to osteosarcoma of the extremities, which has a mean age of 38, osteosarcoma of the spine typically affects older age groups [
5]. The danger lies in its ability to rapidly grow and spread (metastasize) to other body parts, including the lungs. Due to its location near critical nerves and the spinal cord, it can cause severe pain, neurological deficits, and even paralysis. osteosarcoma has a significantly greater death rate than other cancers. Early identification is crucial in these circumstances since it may lower the death rate. Crucial diagnostic tools for osteosarcoma include magnetic resonance imaging, X-rays, and histological biopsy tests. Presently, thorough clinical records are taken at the introductory level of osteosarcoma diagnostic tests and physical exams [
6]. To diagnose osteosarcoma, the knowledge level and experience of the doctor should be proper and high. It can be challenging to distinguish the subtleties of histological images because pathologists must look at many histological slides [
7]. In this context, the use of an automated method for osteosarcoma detection has the potential to alleviate the burdens and obligations faced by pathologists due to the overwhelming volume of cases.
Furthermore, numerous laboratory tests are required due to the rising incidence of cancer, which frequently causes pathologists to become exhausted. Cancer management and diagnostic tests are currently more complicated than ever due to patient-specific treatments [
8]. In recent years, there has been a notable rise in the utilization of automated analysis techniques for microscopic image examination in the context of cancer detection. This trend has emerged as a response to the limitations posed by conventional methods. Radiologists and pathologists can use computer-aided detection (CAD) technology to immediately find neoplasms depending on histopathology image data [
9,
10]. Histological slides are now being converted into digital image datasets in a trend that enables machine learning (ML) to cooperate on photographic files to improve accurate diagnosis. CAD innovation that incorporates potent algorithms, like deep learning (DL) models, which can precisely identify cancerous tumor growth. Researchers have conducted several clinical studies on various illnesses, including osteosarcoma. ML is very efficient for processing digital images and can easily detect and classify osteosarcoma. In the detection of osteosarcoma, researchers have utilized ML and DL approaches, such as convolutional neural networks (CNNs), Support Vector Machines (SVMs), and several other strategies [
11]. The CNN model with data augmentation was employed by Asmaria et al. [
12] as one of the strategies to enhance the performance of the model.They used MATLAB to build the CNN model. Their model performs well in classifying osteosarcoma, and the accuracy reaches 95.37%. Mahore et al. [
13] employed various ML algorithms, including Decision Tree (DT), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and AdaBoost (Adaptive Boosting), to conduct a comparative analysis of the classification of osteosarcoma. The findings revealed that AdaBoost outperformed the other algorithms, achieving an accuracy rate of 91.70%. Several studies have demonstrated the reliable prediction of osteosarcoma using DL systems. The goals of the proposed work are to ensure the development of an expert system to diagnose osteosarcoma, which will aid doctors in treating patients more quickly and effectively, to provide the proposed system as telemedicine since sophisticated diagnostic equipment is not readily available in most rural areas, and to use the proposed system as a smart hospital management system in diagnostic centers.
This study presents evidence of the efficacy of DL-based tools in accurately detecting osteosarcoma tumors. The study utilizes a publicly available dataset and employs a sophisticated classification system incorporating a proposed CNN architecture and a CNN-based voting classifier. This approach, known as heterogeneous ensemble learning (ENL), aims to ensure appropriate patient treatment. The fundamental principle behind ENL resides in amalgamating the predictions derived from multiple models, potentially yielding superior outcomes compared to utilizing any singular model in isolation [
14]. The proposed voting approach’s concepts enhance the majority voting strategy [
15], meticulously designed to address and rectify significant limitations. The dataset of pathology archives from the Children’s Medical Center [
16] has been processed to DL algorithms to facilitate subsequent research to classify tumor, non-tumor, and necrotic tumor cells. Our dataset has uneven distribution, which may cause the splitting strategy to accept an imbalance landmark in the training set. For the unevenly distributed dataset, the biases exhibited by the models may stem from a tendency to prefer a group with a larger population [
17].
Bias in ML is widely regarded as a problematic factor [
18]. Our solution introduces a way for lowering biases to generate a DL model free of any slant. Six modified transfer learning approaches, namely MobileNetV1 [
19], MobileNetV2 [
20], Res-NetV250 [
21], InceptionV2 [
22], NasNetMobile [
23], and EfficientNetV2-B0 [
24] are treated. The improved performance of the adapted transfer learning model over its predecessor architecture can be seen in each scenario. The upper layer has undergone adjustment to optimize the product. Frozen and fine-tuned-based phases are applied to train and assess six distinct transfer learning models. A CNN model with a custom-built architecture is also designed and developed by adapting and enhancing the concept outlined in [
25] to classify osteosarcoma. A comparative analysis has been made. The suggested CNN architecture trained with a balanced training set achieves an accuracy of 95.63%. It outperforms ordinary and fine-tune-based pre-trained models developed from balanced and imbalanced training sets. Moreover, the ENL-based proposed max voting classifier prepared from the proposed CNN, fine-tune-based NasNetMobile, and EfficientNetV2B0 base learner, designated as ENL-CNE, has achieved 96.51% accuracy and outperforms all other models. For the group of cancerous tumors, the proposed ENL model achieves the highest recall, which equals 100%. The subsequent section analyzes the contributions of this study.
A structured dataset for ML-based osteosarcoma classification was constructed. An augmentation strategy into the training data was incorporated.
In transfer learning, six pre-trained CNN models were applied to the dataset for classifying osteosarcoma. An optimal pre-trained model using fine-tuning by unfreezing the entire model was developed.
A CNN architecture was developed that, with a balanced dataset, makes classification more effective and gives a faster classification rate.
An adapted heterogeneous ENL-based voting classifier and brute-force strategy were constructed to evaluate all combinations of base learners systematically.
The performance of all the learning models used in this study and comparisons among them were analyzed.
The remainder of this study is structured as follows. In
Section 2, the literature review has been covered. In
Section 3, the research technique is presented. Details of the implementation are presented in
Section 4. The result analysis is shown in
Section 5. Finally,
Section 6 summarizes the results and discusses potential future studies.
2. Literature Review
The following discussion draws on various available literature concerning the diagnosis of osteosarcoma. Ahmed et al. [
26] proposed a compact CNN architecture to classify small and imbalanced osteosarcoma histology image datasets. The study employed an over-sampling technique to mitigate class imbalance and overfitting. Experimental results demonstrate that the proposed CNN models achieve high accuracies, with the non-regularized model attaining 78% testing accuracy for the imbalanced dataset and 81% testing accuracy for the balanced dataset. The regularized model achieves 75% testing accuracy for the imbalanced dataset and 86% testing accuracy for the balanced dataset. Ahmed et al. [
26], Gawade et al. [
27], Vezakis et al. [
28], and even our study utilizes a similar dataset for the analysis. The dataset employed in these studies consists of microanatomy images of hematoxylin and Eosin-stained osteosarcoma collected by a group of clinical professionals from the University of Texas at Dallas.
Gawade et al. [
27] proposed an automatic DL approach for detecting osteosarcoma bone cancer using CNN-based models. The researchers examined four algorithms to construct their conceptual framework: VGG16, VGG19, DenseNet201, and ResNet101. In their study, the authors [
27] used various performance metrics to assess the effectiveness of their approach. The study used performance metrics, including accuracy, F1 score, precision, recall, AUC, and Vscore, to evaluate the performance. The findings indicated that the ResNet101 model exhibited superior performance compared to the other models, attaining the greatest accuracy rate of 90.36%, F1 score of 89.35%, precision of 89.51%, recall of 89.59%, AUC of 0.946, and Vscore of 2.720.
Furthermore, Vezakis et al. [
28] intended to demonstrate the efficiency of 12 pre-trained DL models for osteosarcoma classification, emphasizing the importance of selecting models with smaller parameter sizes. They split the dataset into 70% for training and 30% for testing. The pre-trained models were fine-tuned using the PyTorch framework, and the top-performing networks with the appropriate image input size were selected. On average, MobileNetV2 was identified as the best-performing model based on the macro-average F1 score.
However, Shen et al. [
29] are devoted to the field of ML and conducted a study to classify osteosarcoma and benign tumor patients using ML algorithms, specifically Random Forest (RF) and Support Vector Machine (SVM). They utilized image features and metabolomic data, evaluating model performance based on accuracy, sensitivity, specificity, P-value, and AUC. The study involved X-ray image segmentation, feature extraction, selection, and ML-based categorization. To increase the accuracy of the models, they used 5-fold cross-validation. The RF model achieved an accuracy of 85%, sensitivity of 92%, specificity of 78%,
p-value of 0.044, and AUC of 0.94. In contrast, the SVM model achieved an accuracy of 81%, sensitivity of 81%, specificity of 80%,
p-value of 0.080, and AUC of 0.86. The performance analysis demonstrates that the RF model outperformed the SVM model. On the other hand, Nabid et al. [
30] introduced a sequential Recurrent Convolutional Neural Network (RCNN) comprising CNN and bidirectional Gated Recurrent Units (GRU) for osteosarcoma classification. The model’s performance was enhanced using strain normalization techniques. Using the osteosarcoma histopathological image dataset, a comparison was made with the pre-trained models, including AlexNet, ResNet50, VGG16, LeNet, and SVM. In [
30], a method was proposed consisting of four Histology Region Convolution (HRC) blocks, followed by bidirectional Gated Recurrent Units (GRU) and dense networks. It achieved an accuracy of 89%, precision of 88%, recall of 89%, and F1 score of 89%. The area under the ROC curve for non-tumor, viable tumor, and necrotic cells were 0.9, 0.86, and 0.88, respectively.
Anisuzzaman et al. [
31] investigated the effectiveness of DL-based pre-trained models for osteosarcoma detection using a public histological image dataset. The objective was to distinguish necrotic images from non-necrotic and healthy tissues. The novelty of the proposed approach in [
31] lies in applying pre-trained models to different dataset categories, using the entire tile image as input. Without patches, transfer learning techniques such as InceptionV3 and VGG19 were utilized on Whole Slide Images (WSI). Both binary and multi-class classification were performed using VGG19 and InceptionV3. The models were trained for 1500 epochs with an Adam optimizer and a learning rate of 0.01. The VGG19 model demonstrated the best level of accuracy across all scenarios. In addition, Mishra et al. [
32] proposed using CNN to enhance the efficiency and accuracy of classifying osteosarcoma tumors into tumor classes (viable tumor, necrosis) versus non-tumor. Their study introduces a novel application of CNN designed for osteosarcoma image classification. The dataset employed in their study comprised one thousand images categorized as Viable, Necrosis, and Non-Tumor.
On the contrary, certain investigations are undertaken utilizing genome data. To examine the expression profile of repetitive elements (RE) in osteosarcoma, Ho et al. [
33] conducted their study. They analyzed the entire RNA of 36 fresh-frozen paired samples from osteosarcoma patients, 18 of which were tumors and 18 of which were not. They discovered that Eighty-two repetitive DNA elements (REs) expressed differentially in osteosarcoma and normal bone. A total of 35 REs were up-regulated, and 47 were down-regulated out of all the significantly altered REs. Reimann et al. [
34] identify innovative biomarkers for osteosarcoma. The genes in which the mutations were identified can be regarded as potential candidates for the identification of biomarkers for osteosarcoma. In the exome of the tumor, a comprehensive analysis revealed extensive genomic rearrangements that meet the criteria for chromotripsis. Next-generation sequencing was employed to analyze the complete exome of both tumorous and non-tumorous bone tissue samples obtained from a patient diagnosed with osteosarcoma. Multiple software programs were used for data processing, in which exome data were integrated with RNA-seq data. Their investigation identified about three thousand somatic single nucleotide variations (SNVs) and minor insertions or deletions, as well as over two thousand copy number variants (CNVs) distributed across various chromosomes. They also observed that somatic modifications are specifically related to the development of bone tumors, while germline mutations are related to the occurrence of cancer in a broader sense.
The work introduces a CNN architecture consisting of three sets of convolutional layers paired with corresponding max-pooling layers, which are employed to enhance the feature extraction process. Additionally, two fully connected layers are used to enhance data augmentation. The researchers explored different baseline architectures with varying hidden layers to optimize performance. The extended neural network version (with increased hidden layers and decreased filter size from 5 × 5 to 3 × 3) outperformed the simple baseline architecture. The accuracy rates for different classes in the baseline implementation and the proposed architecture were as follows: Viable—83% and 92%, Necrosis—73% and 90%, and Non-Tumor—91% and 95%, respectively. Moreover, the average accuracies of AlexNet, LeNet, VGGNet, baseline architecture, and their proposed architecture were 73%, 67%, 67%, 84%, and 92.40%, respectively. Asito et al. [
35] proposed a computer-aided diagnosis system using CNNs for osteosarcoma detection on bone radiography. They employed a window-based approach, where CNNs were applied to classify each window and identify cancer-affected regions in the image. The dataset used in the study originated from a study conducted at the University of Sao Paulo. The windows were categorized as normal or tumor (osteosarcoma) using CNNs, comparing their custom CNN model and a pre-trained VGG16. Beyond these techniques, Decision Tree, Random Forest, MLP, and MLP with feature selection classifiers were employed. The pre-trained CNN achieved the highest accuracy of 77% and the highest sensitivity of 84%, and the MLP with feature selection algorithm also achieved the highest sensitivity of 84%. The MLP attained the highest specificity of 76%. These findings highlight the effectiveness of CNNs in osteosarcoma detection on bone radiography and demonstrate the superior performance of the pre-trained VGG16 compared to the other models.
3. Research Methodology
In this section, the research methods used for the study have been illustrated.
Figure 1 concisely demonstrates the proposed methodology. The following phases are used to develop our study. After obtaining the dataset from the Cancer Imaging Archive, the dataset is organized into three folders and known as class names. Next, the dataset is divided into two portions: 80% for training and 20% for testing. The raw dataset is highly imbalanced, so data balancing has been performed on the training set using a data augmentation library named “Albumentations”. The minority classes have been over-sampled to the highest class. Subsequently, the training and test sets have undergone image preprocessing procedures, including image normalization.
A CNN model with a customized architecture tailored for this study undertaking and six other deep transfer learning pre-trained CNN models, namely Mo-bileNetV1, MobileNetV2, ResNetV250, InceptionV2, NasNetMobile, and EfficientNetV2B0 have been applied to the training set. Every model has undergone a comprehensive evaluation, culminating in a comprehensive examination of the collective findings. Additionally, an adapted voting classifier is shown in
Figure 2, which constitutes a specialized implementation of heterogeneous ENL, has been devised, and certain drawbacks are also mitigated.
The ENL approach is heterogeneous, as the constituent base models encompass diverse types [
36]. Adopting the max voting technique is intended to improve the effectiveness of DL classifiers [
37]. Algorithm 1 demonstrates the proposed modified majority voting ensemble approach. In this approach, the vote counter tallies the votes from various algorithms for each category corresponding to every testing instance and stores them in CF. Subsequently, the final prediction FPrei describes the category that garners the highest frequency value. The drawbacks, like two or more categories occurring the same number of times, are addressed by incorporating class probability, as outlined in lines 16–21 of Algorithm 1. As depicted in
Figure 2, the smart voting coordinator effectively overcomes these limitations by deriving the ultimate output from the highest frequency value obtained through the vote accumulation facilitated by the vote counter. Subsequently, the smart voting coordinator utilizes a brute-force mechanism to assess every conceivable combination of the underlying base learners rigorously. Wherein a combination comprises a minimal count of base learners, precisely two. Such strategic coordination ensures a robust and accurate final prediction. Reduced mortality upon osteosarcoma diagnosis is the main objective in clinical procedures. The early-stage tumor must be kept from metastasizing at all costs. In addition to lowering the likelihood of a false positive, early automatic detection can also be used to support the physician in deciding whether metastasis has occurred. Using CNN, computer-aided technology, the effort of the physician can be significantly reduced, and patient outcomes can be improved. Algorithm 1 describes the DL models used in this study.
Algorithm 1 Adapted Majority Voting Ensemble Algorithm |
|
3.1. Deep Learning Algorithms
This section will comprehensively discuss the deep learning (DL) methods employed in our investigation. The fundamental elements of the deep CNN model, together with six additional pre-trained deep transfer learning models, namely MobileNetV1, MobileNetV2, ResNetV250, InceptionV2, NasNetMobile, and EfficientNetV2B0, have been elucidated.
CNN: Among all DL networks, CNN is widely utilized, particularly for computer vision activities. Soon afterward, in Waibel et al. [
38] and Lecun et al. [
39] developed two different architectures of CNNs for phoneme recognition that shared weights between temporal receptive fields and back-propagation training and a useful CNN architecture for document recognition, respectively. CNN belongs to DL networks and is a supervised ML algorithm. The key convenience of CNN is that it can automatically extract essentials from the dataset compared to its predecessors [
40] as it consists of some primary layers [
41]. The subsequent section delineates its several layers.
Convolutional Layers: It is one of the most significant layers of CNN. In this layer, kernels or filters of weights are convoluted for feature extraction, which is the main benefit of CNNs.
Pooling Layers: The main objective of the pooling layer is to decrease the spatial dimensions of the input image systematically, therefore reducing the computational load imposed on the network. In CNN, pooling reduces the size of the down-sampling operation. It sends only the most crucial data to subsequent layers.
Dropout Layers: The dropout layer drops random nodes to reduce overfitting. The main goal of the dropout layer is to drop random nodes throughout various iterations of the process and introduce variability and non-linear effects to the training set [
42].
Fully Connected Layers: The fully connected layer is one of the most elemental components in CNN. The final several layers of the network are known as fully connected layers. The fully connected layer is responsible for receiving the output from the preceding pooling or convolutional layer. Prior to its application, the output is flattened. In a fully connected layer, the input first undergoes multiplication by a weight matrix and then an addition of a bias vector [
43].
MobileNetV1: MobileNet is a pre-trained model in Transfer Learning of CNN architecture, trained with the ImageNet dataset. Its creation aimed to optimize precision, considering the constraints imposed by the restricted resources typically available for on-device or embedded applications. The foundation of MobileNet is depthwise separable convolutions, which have pointwise and depthwise convolutions as their two main internal layers. Filtering the input without adding new features is called depthwise convolution [
44]. Thus, pointwise convolution—a technique for creating additional features—was merged. Depthwise separable convolution is the name given to the two layers together. Each input channel underwent a singular filter application through depthwise convolutions. The resulting output from the depthwise layer was subsequently merged in a linear manner using 1 × 1 convolutions (pointwise).Following each convolution, the techniques of batch normalization (BN) and rectified linear unit (ReLU) were applied [
45].
MobileNetV2: MobileNet network is frequently a pre-trained model in CNN architecture’s Transfer Learning, trained on the ImageNet dataset. With 1.4 million photos and 1000 classes of online images, the ImageNet dataset was used as MobileNetV2’s pre-trained training set. MobileNetV2 is a lightweight neural network. MobileNetV2’s fundamental architecture is based on that of MobileNetV1, its predecessor. Fifty-three layers make up the CNN known as MobileNetV2. Google Inc. has published MobileNetV2 [
46]. MobileNetV2 employs linear bottlenecks to implement the depthwise separable convolutions (DSC) technique for probabilistic computations. Such a technique focuses on the problem of information degradation within non-linear layers seen in convolutional blocks.It is a very efficient feature extractor for image classification [
47].
ResNetV250: In 2016, He et al. [
48] developed a deep residual network or ResNet model. ResNet network is a pre-trained model in Transfer Learning of CNN architecture, trained with the ImageNet dataset. DL training has several challenges, including time consumption and limited layers. The study [
49] was created to address the complexity of DL training. The computation time of ResNet has made it more efficient; it takes low computation time, and the ability to train is excellent. Vanishing gradient and K. He degradation problems are there in deeper neural training. When ResNet has 50 layers total, then it is called ResNet50. The residual network architecture’s capacity to accept images of sizes different from those used for training is another reason to use it. The ImageNet dataset is responsible for the weights used in ResNet.
InceptionV2: The inception-v2 network is frequently pre-trained in CNN architecture’s Transfer Learning. It is the second generation of the inception convolutional network. Batch normalization is prominently used in Inception-v2. In addition, dropout and local response normalization have been eliminated due to the advantages of batch normalizing. It takes 224 × 224 sized images as its input. The architecture of inception-v2 includes 3 × 3 sized filters, whereas inception-v1 has 5 × 5 sized filters, making the second inception version faster [
50].
NasNetMobile: The NasNetMobile is a CNN trained on a dataset consisting of more than one million images obtained from the ImageNet collection. The Neural Architecture Search Network was conceived and developed by the Google Brain team. It is an adaptable CNN architecture where reinforcement learning is used to optimize the building blocks (cells). It comprises normal and reduction cells, its two primary functionalities [
51]. NasNet designs come in two major varieties: NASNetLarge and NasNetMobile. According to the network’s necessary capacity, a cell comprises just a few processes and is repeated several times.
EfficientNetV2B0: An efficient network is a pre-trained model in the CNN architecture’s Transfer Learning that was trained using the ImageNet dataset. The efficient network initially proposed by Tan and Le Deng et al. [
52] was termed EfficientNet. The EfficientNet model has eight varieties. The EfficientNet series network can be subdivided into eight sub-networks, B0–B7, based on the degree of the scale, with each model number corresponding to a version with more parameters and greater accuracy. Google AI created the model, and it is accessible through GitHub repositories. Transfer learning is used in the EfficientNet architecture to save processing time and power. The EfficientNet Models have scaled CNN models that have already been trained and may be applied for transfer learning in image classification issues [
53]. Tan and Le [
54] further enhanced the Efficient network in 2021, named the EfficientNet-V2 network. They divided the enhanced Efficient network into S, M, and L sub-networks. After experimental validation, the new network is more efficient, consumes fewer resources, and has greater real test accuracy than the previous EfficientNetV1 [
54].
3.2. Data Collection
The dataset for this investigation was obtained from the Cancer Imaging Archive website [
11]. The dataset named “Osteosarcoma Data from UT Southwestern/UT Dallas for Viable and Necrotic-Tumor Assessment (Osteosarcoma-Tumor-Assessment)” contains 1144 images of size 1024 × 1024 at 10× resolution. It consists of histology images of osteosarcoma stained with hematoxylin and eosin (H&E). The histology images included in the dataset were obtained from Children’s Medical Center, Dallas. The dataset encompasses a total of 50 patients who were treated at the medical center throughout the period spanning from 1995 to 2015. The images in
Figure 3 are categorized based on the predominant type of cancer present. These categories include Non-Tumor, which indicates the absence of tumor cells; Viable Tumor, which indicates the presence of actively growing tumor cells; and Necrosis Tumor, which indicates the presence of tumor cells that have been destroyed. Among these, the non-tumor category comprises a total of 536 histological photographs. The viable-tumor category encompasses 345 images, while the necrotic-tumor category includes just 263 histological images.
3.3. Data Preprocessing and Normalization
Image preprocessing is a technique employed to prepare images for utilization in model training and inference. Additional preprocessing processes encompass resizing, orienting, and color modifications. Preprocessing aims to improve picture data that reduces unintentional distortions or increases visual properties crucial for further processing. The size of the images within the dataset utilized in the present investigation is 1024 by 1024 pixels. The images were resized into 224 × 224 pixels to make the computations faster. Normalization is a technique used in image processing to modify the range of pixel luminance levels. The typical function of image normalization is to transform an input image into pixel levels that are more conventional or comfortable to the senses. The images consist solely of a composite of distinct pixel values dispersed across the range of 0 to 255. Working with huge values is impractical and time-consuming, necessitating more capable computing devices. However, the normalization process involves dividing the pictures by a value of 255, which reduces such burden.
3.4. Dataset Splitting
The dataset must be divided into a particular size for training and testing. We should keep most of the data from the training set rather than the testing set to build an accurate model [
55]. In this study, the dataset was divided into 80% and 20% ratios for training and testing, respectively. A total of 10% of the training set examples were used as a validation set.
3.5. Dataset Balancing and Augmentation
The dataset utilized in this study presents a highly imbalanced distribution, which significantly impacts the obtained results. Such data imbalance poses a considerable challenge, as it may introduce biases and hinder the effective application of traditional learning algorithms in real-world domains. A pivotal step has been taken to balance the dataset after splitting it into training, testing, and validation sets [
56] with one of the data augmentations libraries named “Albumentations” [
57]. Albumentations is a quick and adaptable open-source library for image augmentation that offers a wide range of image transform operations and functions as an intuitive wrapper for other augmentation tools [
58]. After splitting the dataset, the training set contains the following images: the non-tumor class contains 422 images, the necrotic-tumor class contains 208 images, and the viable class contains 285 images. The minority classes in the training set have been over-sampled to the highest class. The number of necrotic-tumor and viable-tumor images have been over-sampled into 422 images. The training set was over-sampled using the technique of horizontal flipping.
Figure 4 demonstrates the data distribution of each class before and after balancing. Our training dataset applies augmentation techniques such as vertical flip, rotation, and brightness adjustments. Data augmentation is a strategy employed to expand the volume of data utilized to train a model. DL models sometimes require significant training data to provide reliable predictions, which may not always be readily available. Consequently, the available data are expanded to enhance the development of a more comprehensive model. The ImageDataGenerator class from Keras API was used to ensure that the model is exposed to novel modifications of the images throughout each epoch. One notable benefit of utilizing the ImageDataGenerator is its ability to minimize memory use effectively.
5. Results and Discussion
This section presents an examination of the results derived from each model. The pre-trained CNN models have been trained on the imbalance training set in two distinct phases, where all the weights of each layer of the models are kept the same as the original model (Frozen), and second, where all the weights of each layer are trained (Fine-Tuning).
Table 3 demonstrates the efficacy of each model on an unbalanced training set. Among all pre-trained models, MobileNetV1 had the best accuracy, precision, recall, and f1-score, 94.32%, 94%, 94%, and 94%, respectively, and Kappa is 90.93%. Then EfficientNetV2B0 comes simultaneously with 93.89% accuracy, 93% precision, recall, and f1-score. The ROC score and log-loss of EfficientNetV2 B0 are 0.990 and 0.303, respectively.
To obtain better performance and to make the evaluation logical and unbiased, the training set has been balanced, and all the models have been applied to the balanced set.
Table 4 displays the results of all models on an evenly distributed dataset. In most instances, the overall efficacy of all models has been enhanced. For example, MobileNetV2, NasNetMobile, and EfficientNetV2B0 trained in fine-tune mode indicate the finest accuracy among all pre-trained models.
The line graph in
Figure 6 demonstrates the analogy of the Kappa score of diverse frozen and fine-tune-based transfer learning models prepared from balanced and imbalanced training sets. It is reasonable to observe that the fine-tuned models, namely MobileNetV2, NasNetMobile, and EfficientNetV2B0, trained on a balanced dataset, have demonstrated improved Kappa scores compared to their prior iterations, indicating their higher performance. Frozen-based ResnetV250 prepared from a balanced training set is also responsible for showing the top score compared to its previous states. NasNetMobile has the second-highest accuracy and Kappa score of all pre-trained models. Again, NasNetMobile demonstrates the lowest log-loss, indicating superior probabilistic estimation and uncertainty quantification capabilities. Fine-tune-based MobileNetV1 trained with an imbalanced dataset had the best accuracy and Kappa score of any pre-trained model.
Confusion matrices in
Figure 7 and
Figure 8 convey a clear visual of the performance gap between MobileNetV1 and NasNetMobile. The MobileNetV1 model elucidates superior performance in classifying the “Non-Tumor” and “Viable-Tumor” categories. Conversely, the NasNetMobile model accurately classifies instances of the “Necrosis Tumor” class, correctly identifying 52 examples from the test set. These findings underscore the strengths of each model in handling specific tumor classes, providing valuable insights for targeted application and analysis in medical image classification tasks.
The proposed CNN model has also been trained with the same imbalanced training set presented in
Table 3. The best results have been obtained from the proposed CNN architecture among all other models prepared from the imbalanced set where the accuracy, precision, recall, f1-score, ROC score, Kappa, and log-loss are 95.20%, 95%, 95%, 95%, 0.995, 92.33%, and 0.129, respectively. In
Table 4, it is shown that our proposed CNN architecture has also been trained with a balanced training set. The suggested CNN model’s performance exhibits favorable results compared to current models that have been trained using either a balanced or unbalanced training dataset.The highest accuracy of 95.63% is attained using the suggested CNN approach. Its precision, recall, f1-score, ROC score, Kappa, and log-loss are 95%, 96%, 95%, 0.993, 93.09%, and 0.158, respectively.
The training and validation accuracy curves illustrate a gradual increase in the validation accuracy line, closely following the trend of the training accuracy line. Similarly, the training and validation loss curves depict a steady reduction in the validation loss, mirroring the pattern of the training loss.
Figure 9 and
Figure 10 exhibit graphical representations of the training and validation accuracy and loss curves for the CNN model developed in this study. These figures depict the performance of the model on the balanced dataset. These plots offer valuable insights into the model’s performance and convergence during training, enabling a comprehensive evaluation of its learning capabilities.
In the test dataset, the number of non-tumor images is 114, whereas the model can classify 108 images correctly. A total of 5 images have been classified as necrotic-tumor and 1 image as viable. In the necrotic-tumor class, the images are 55, whereas 54 images are classified correctly, and 1 image is classified as non-tumor. In the viable class, the total number of images is 60, whereas 57 images are classified correctly, and 3 are classified as necrotic tumors. The confusion matrix of the proposed CNN model on the balanced dataset is shown in
Figure 11.
Table 5 shows the class-wise performance of the proposed CNN model on a balanced training set. In this context, our proposed CNN model notably achieves the highest levels of accuracy, AUC, and f1-score for the “Viable” class. Additionally, it attains the maximum precision for the “Non-tumor” class and the highest recall for the “Necrotic-Tumor” class.
Figure 12 provides a clear comparative visualization of the proposed CNN model’s class-wise accuracy, precision, recall, f1-score, and AUC score on the balanced dataset. The graphical representation allows for an intuitive understanding of the model’s performance across different classes, aiding in assessing its strengths and weaknesses in classifying individual categories.
In the AUC ROC analysis of the proposed CNN model demonstrated in
Figure 13, the micro-average and macro-average AUC achieve an impressive score of 99%.
The findings are obtained from evaluating all combinations of balanced fine-tune-based models, including the suggested CNN model trained on a balanced training set. Wherein a combination includes at least two primary learners. For example, the data contains the performance metrics of three ensemble models that have demonstrated high-performance levels, recorded in
Table 6.
Table 6 shows that the ensemble model ENL-CNE shows the highest precision, Kappa score, recall, F1 score, and accuracy compared to the other two. ENL-CNE outperforms all other models in terms of accuracy, Kappa score, precision, and F1 score.
Class-wise performance comparison of the proposed CNN model and proposed ensemble learning-based ENL-CNE model has been displayed in
Table 7.
The proposed CNN model has increased precision for non-tumors and superior recall for necrotic tumors. However, the ENL-CNE model outperforms the proposed CNN model in all other circumstances.
Figure 14 exhibits the confusion matrix for the proposed ENL-CNE model. One hundred 14 non-tumor images are present within the test set, of which the model accurately classifies 110 images. In the necrotic-tumor class, comprising 55 images, the model correctly classifies 51 images. Similarly, in the viable class, encompassing 60 images, the model achieves precise classification for 60 images. The proposed ENL model achieves an outstanding classification rate for the group of cancerous viable tumors.
The findings of our suggested CNN model are compared in
Table 8 with those of other studies that have used the same osteosarcoma dataset. Among existing literature, the analysis performed by Ahmed et al. [
26] shows the lowest accuracy from their proposed CNN, and VGG19 is liable for the highest accuracy when Anisuzzaman et al. [
31] redacted the analysis. The CNN model introduced by Mishra et al. [
32] attained the second-highest accuracy of 92.40% among the existing methodologies. Mahore et al. [
13] and Vezakis et al. [
28] achieved commendable accuracies of about 91% by employing AdaBoost and MobileNetV2, respectively. Our proposed CNN exceeded these figures with an accuracy of 95.63%. Furthermore, our novel approach, the proposed ENL-CNE classifier, which is an ENL-based model composed of the suggested CNN, fine-tuned NasNetMobile, and EfficientNetV2B0 base learners, pushed the boundaries even further, achieving an impressive accuracy of 96.51%. The comparative analysis underscores the robustness of our methodologies and their potential to advance the field’s standard of exactness. Even though different validation methods affect comparisons, our study’s success shines and shows how far our research has come in accuracy.
The Gradient-weighted Class Activation Mapping (Grad-CAM) technique has been employed to enhance the interpretability of our model’s visualization. The CNN modules are designed to extract information from images at multiple layers, therefore capturing a range of levels of abstraction. The Grad-CAM technique utilizes the gradients of the score of the target class to the feature maps of a specific convolutional layer. These gradients indicate how changes in the feature maps affect the final classification score [
65]. In
Figure 15, Grad-CAM provides a visualization that helps to interpret and understand our proposed CNN’s decision-making process, making it more transparent and explainable.