1. Introduction
According to the Central Brain Tumor report of the National Brain Tumor Society [
1], about 1M Americans are living with a brain tumor today, and more than 90K people will be diagnosed with a primary brain tumor in 2023 [
2]. A brain tumor is defined as abnormal cells in the brain, which are frequently categorized as benign/low grade (non-cancerous) or malignant (cancerous) [
3]. Benign tumors (grades I and II) are non-progressive and, therefore, considered less aggressive. They originate in the brain, grow gradually, and cannot spread to other body parts. Malignant brain tumors, on the other hand, come in a variety of sorts and grades. The predominant primary brain tumors in adults are gliomas (comprising astrocytomas, oligodendrogliomas, and ependymomas) and meningiomas. Meningioma is a single category in the WHO’s Central Nervous System (CNS) fifth edition, with 15 subtypes. Most subtypes are benign and considered CNS WHO grade I [
4]. Glioma tumors have diverse intensities and spread across the brain’s glial cells. They can be classified into four grades (grades I through IV), from benign to the most malignant [
5]. Pituitary adenomas are typically benign, slow-growing tumors and are the most common type of pituitary gland tumor [
6]. Meningioma and pituitary tumors both grow around the skull region and the pituitary gland, respectively. Thus, brain tumor detection for early diagnosis becomes a critical yet challenging task for assisting in the appropriate selection of treatment options to preserve the patients’ lives.
Generally, a series of physical and neurological examinations are utilized to diagnose brain tumors. The most reliable method for diagnosing brain tumors is through biopsy. This involves removing and examining a tissue sample under a microscope using various histological techniques. However, biopsies are invasive and carry a risk of bleeding, tissue injury, and functional loss [
7]. Diagnosing brain tumors poses a challenge for healthcare providers due to their complicated nature. The prompt identification and treatment of brain tumors is critical to the survival rate of these patients [
8]. Various imaging techniques (e.g., magnetic resonance imaging (MRI) and computerized tomography (CT)) are effectively utilized as noninvasive assistive tools for diagnosis, and a biopsy and pathological examination are performed to confirm a diagnosis. MRI is the preferred choice among these imaging techniques as both a non-ionizing and non-invasive method [
5]. At the core of modern neuroimaging is the adoption of non-invasive brain tumor diagnosis and classification using MRI as it permits clinicians to examine the fundamental, structural, and functional characteristics of brain tumors [
7,
9]. As a result, medical imaging plays a crucial role in determining the type and stage of the tumor and developing a treatment plan. However, the manual review of these images is time-consuming, disorganized, and prone to errors due to patient volume. Brain tumor classification is one of the most difficult tasks due to its heterogeneity, isointense and hypointense features, and associated perilesional edema. T1-weighted (T1-w) contrast-enhanced MRIs are typically used to classify primary tumors, such as meningioma, and secondary cancers. As opposed to standard feature extraction and classification techniques, the main approaches for classifying brain tumors rely on region-based tumor segmentation. As a result, using AI-based tools, e.g., deep learning (DL) methodologies, the paradigm changed toward classification challenges.
Various traditional machine learning (ML) or advanced DL models have been adopted to build effective tools for brain tumor classification. In [
10], Cheng et al. enhanced tumor region-of-interests with an adaptive spatial technique to divide these regions into subregions. Model-based features were extracted, including histogram-derived, texture-derived using gray-level co-occurrence matrix (GLCM), and bag-of-words features. Their experiments showed accuracies of 87.54%, 89.72%, and 91.28% on estimated features using the spatial partition method. However, a smaller dataset and only engineered features were utilized for classification. Using principal component analysis (PCA), Rathi and Palani [
11] developed a brain tumor classification tool using linear discriminant analysis. They incorporated several hand-crafted features, such as intensity, texture, and shape, to label brain tissues as white matter, gray matter, CSF, abnormal, and normal. The classification was based on a support vector machine (SVM). However, their approach demonstrated a lower average accuracy of 0.83 (sensitivity: 0.88; specificity: 0.80); and their method employed hand-crafted features only and used traditional ML. Kumar et al. [
12] proposed a structure for classifying brain tumors in MRIs based on gray wolf optimization and multiclass SVM. The gray wolf optimization technique performed better than the firefly algorithm and particle swarm optimization and reached a classification accuracy of 95.23%. Unlike the proposed system, they only employed GLCM for feature extraction and used a smaller dataset (3,064 images). Ismael and Abdel-Qader [
13] integrated 2D discrete-wavelet transform and Gabor filtering to build a strong transform-domain statistical feature set to classify brain tumors. They used a back-propagation multilayer neural network using the derived statistical MRI characteristics. Also, their approach combined the two methods and achieved a fairly small score of 91.9% for accuracy using a small dataset. Similar to [
10], Abir et al. proposed a probabilistic neural network for brain tumor classification [
14]. In preprocessing, they applied image filtering, sharpening, resizing, and contrast enhancement, as well as extracting texture features using GLCM. Their suggested method achieved an accuracy level of 83.33%.
In addition to research work utilizing the extraction of hand-crafted features (e.g., [
10,
11,
12,
13]), a model involving deep architecture to classify images in self-learning scenarios has also been developed. A modified capsule network called CapsNet was suggested by Afshar et al. [
15]. This network used the spatial link between the brain lesion and the non-tumor tissues around it. The CapsNet architecture renders an overall accuracy of 88.33%. Although they used a modified neural architecture, the method scored a low accuracy score with a fairly longer training setting (50 epochs). Abiwinanda et al. [
16] used a CNN-based DL model to classify brain tumor images using five classification models. A ReLU layer and a maximum pool layer were included in the final architecture. The study claimed a validation accuracy of 84.19%. However, a shallow architecture was built and validated on a small dataset with three classes [
17]. Deepak et al. [
18] used the same data source as [
16] and the same deep TL technique (GoogleNet) to classify the images. The images were analyzed for characteristics, which were then employed in compiling test and classification models. The authors reached an accuracy of 98% using a five-fold cross-validation. Although the overfitting was studied, system performance is affected by the training sample size reduction. Also, a high learning rate and a single classifier (SVM or KNN) were used. Similarly, Swati et al. [
19] exploited pre-trained CNNs and employed a block-wise fine-tuning mechanism using a pre-trained VGG19 and achieved an overall accuracy of 94.82%. A hybrid feature extraction method by Gumaei et al. [
20] was developed for classifying brain tumors. The feature vector was extracted using PCA and normalized GIST descriptors. Finally, a regularized extreme learning machine was proposed to classify brain tumors with 94.23% accuracy. Although their method showed improved classification accuracy, they used a hold-out evaluation method tested on 3064 images. A generic CNN-based algorithm consisting of six convolutional and max-pooling layers and only one fully connected layer was introduced by Anaraki et al. [
21]. Following that, the best model generated by the genetic algorithm (GA) is averaged. Brain tumor classification accuracy reached 94.2% on the tested dataset. Their method was implemented using a longer training time frame (100 epochs), and a holdout evaluation was conducted using only 615 images (500 + 115) for testing.
Recently, Sharif et al. [
22] proposed a decision support system utilizing pre-trained DenseNet201. Entropy–Kurtosis-based high feature values (EKbHFV) and modified GA (MGA) meta-heuristics were used for feature extraction. BRATS2018 and BRATS2019 datasets were evaluated using a multiclass SVM cubic classifier. The accuracy on BRATS2018 (BRATS2019) was 99.7% (99.8%) and 98.8% (99.3%) on glioblastoma (GBM/HGG) and lower grade glioma (LGG), respectively. The main limitation of their study is the removal of certain important features that affect the system’s accuracy. A comparative study was proposed using five CNN architectures by Asif et al. [
23]. They modified the final layers of Xception, DenseNet201, DenseNet121, ResNet152V2, and InceptionResNetV2 with a deep dense block and softmax layer as the output layer. Using the Figshare dataset (3064 T1-w MRIs), they achieved an accuracy of 99.67% on the three-class dataset and 95.87% on the four-class (inclusive of healthy patients) dataset. The results show that the proposed model based on Xception architecture is the most suitable deep model for multi-class brain tumor classification. Despite the high accuracy, they applied fine-tuning parameters for the pre-trained models only using small-size training data. Agrawal et al. [
24] developed a DL model called MultiFeNet that uses multi-scale architecture for feature extraction. Instead of employing various kernel sizes, multi-scaling was implemented using a dilation rate. The introduced model was tested using five-fold cross-validation to achieve 96.4% for sensitivity, F1-score, precision, and accuracy. However, the multi-scale feature scaling increased system complexity, computational expense, and network training and hence optimization complexity. Also, there was a lack of cross-dataset generalization as the authors tested/trained their method on the same dataset. A transfer learning (TL)-based DL approach for the multi-class classification of brain tumor type via fine-tuning of pre-trained EfficientNets was proposed by Zulfiqar et al. [
25]. Five variants of modified EfficientNets were trained under different experimental settings. GradCAM-based visualization maps of modified EfficientNetB2 were applied to MR brain tumor sequences. The reported accuracy, precision, recall/sensitivity, and F1-score were 98.86%, 98.65%, 98.77%, and 98.71%, respectively. However, reduced accuracy of 91.53% was observed when performing cross-validation experiments on different datasets. A predictive CNN model using a hybrid generative adversarial network (GAN) was proposed by Sahoo et al. [
26]. Both GAN-augmented samples and the original augmented dataset were fed into an in-house CNN classification model. Among various GAN architectures, progressive-growing (PGGAN) demonstrated accuracy, precision, recall, F1-score, and NPV metrics of 98.8%, 98.45%, 97.2%, 98.11%, and 98.09%, respectively. Although the system showed promising results on various diseases, brain tumor recognition was evaluated in a small dataset, Figshare, using train/validation/test split only. El-Wahab et al. [
27] leveraged the benefits of using the 1 × 1 convolution layer and TL to realize a fast classification brain tumor process. The method realized an average accuracy of 98.63%, using five TL iterations, and 98.86% using retrained k-fold cross-validation (internal TL between the folds, k = 5). Despite the fact that the method overcame the overfitting issue because of unnecessary parameters, it was trained only on the Figshare dataset with increased computational cost and struggled with noisy data. In [
28], a reinforcement learning-based architecture was introduced by Chaki et al. Similar to the present study, they used multiple datasets. The first dataset included 7023 images (1645 for meningioma, 1621 for glioma, 1757 for pituitary, and 2000 for normal classes), and the second dataset contained 253 images (155 brain tumors and 98 normal). Their suggested approach achieved an accuracy of 97.5%. The limitation of their study is that the interference time was too long (9 h). The crossover smell agent-optimized multilayer perception (CSA-MLP) was suggested for the identification and classification of a tumor by Arumugam et al. [
29]. Their images have been collected from three datasets, and their method was based on employing CNN+ MLP classification head, which was incorporated with the CSA optimization algorithm to enhance the accuracy. Their approach scored an accuracy of 98.56%. Although the authors combined CNN and higher hand-crafted features, only binary classification and holdout validation scenarios were employed.
Various studies have been proposed in the literature with promising results, and this paper extends the existing work for brain tumor classification. Generally, most of the above-mentioned studies used a single dataset compared to our approach, where we integrated multiple datasets for system training and evaluated our method on an additional local dataset. The limited availability of annotated data combined with the small size may lead to overfitting and hinder the generalization of the proposed framework to diverse clinical scenarios. While pre-trained CNNs offer transferable features learned from large-scale datasets, their applicability to medical imaging tasks, particularly brain tumor characterization, may be constrained by domain-specific variations and complexities not adequately captured in general-purpose pre-trained models.
The proposed framework utilizes an ensemble architecture as an automated tool for tumor detection and diagnosis. The developed architecture integrates multiple learnable modules with the ability to capture localized and long-range dependencies. Integrating those modules improves the system’s ability to recognize complex patterns and facilitates the understanding of the context of features across the entire brain image, thus improving the system’s performance. The main contribution aspects of the present work include (i) a robust ensemble architecture that integrates a weighted feature fusion of information-rich features for classification as compared to direct feature concatenation; (ii) prominent disease-related features are retained through three learnable modules compared to single feature-based methods; (iii) capturing of long-range dependencies, non-local relationships, and complex pattern recognition by integrating a ViT in addition to a localized CNN-based method;and (iv) we documented an improved system classification accuracy using public local datasets via the utilization of a cross-validation evaluation scenario.
The paper is structured into five sections, starting with the introduction of modern CAD systems for brain tumor detection and diagnosis, the relevant review of the recent and related literature work followed by the outlined contributions of this work in
Section 1. Full descriptions of the data, the methodology, and the details of the learnable modules and features extractions strategies are completely given in
Section 2. Employed evaluation criteria, conducted experiments, and obtained results are given in
Section 4. A discussion of our results and associated conclusions as well as future work suggestions are outlined in
Section 5 and
Section 6, respectively.
4. Experimental Results
In this work, the proposed ensemble architecture is evaluated using the public and locally-acquired datasets described in
Section 2.1. For the public dataset, the ground truth labels were provided to the researchers and used to assess the accuracy/sensitivity of the proposed system. All patients were first described for the local dataset based on MRI findings on T2, DWI, and contrast-enhanced T1 images. All patients with malignant tumors had a biopsy.
The back-end for the proposed architecture was Tensorflow 2.13.0 and Python 3.8. The CNN as a feature extractor consisted of five layers, each followed by a ReLU activation function and a max-pooling layer to improve computational efficiency. A flattened layer was introduced to reshape the feature maps to the one-dimensional vector, passed into an MLP trained over 50 epochs with a batch size of 125 for classification. MLP employed a sparse categorical cross-entropy loss function and Adam optimizer with a learning rate starting at 0.001 and was reduced by default during the training process for better results. The ViT model was configured to extract a 16 × 16 image patch size, with 20 layers, 3072 hidden dimensions, and 12 attention heads. A NumPy expanded dimension parameter was added to represent an additional dimension added to the batch dimension. A predicted image parameter was provided to pass the preprocessed image through the ViT model to extract features. A flattened feature parameter converted the extracted features to a one-dimensional vector. The MLP classifier model consisted of a flattened layer with 86,000 features, three dense layers each, ReLU activation, 0.5-dropout, batch normalization, and Softmax activation for the final layer. The MLP model was compiled and evaluated with the Adam optimizer, a batch size of 125, 30 epochs, sparse categorical cross-entropy loss, and accuracy metrics.
All of our experiments and analysis were conducted on a Dell workstation with a 12th Gen Intel
® Core™ i9-12700, 20 processors, 64.0 GB of memory, 1.5 TB disk capacity and an NVIDIA GeForce RTX 3060 with GPU. The end-to-end execution time for testing the proposed ensemble was 60 ± 0.29 s, including feature extraction, fusion, and MLP classification. The model assessment used five-fold cross-validation for performance evaluation. As an unbiased estimator, cross-validation enhances how the deep architecture will transfer to an independent dataset and helps to partially decrease overfitting or selection bias. For cross-validation, we used a ratio of 80% to 20% of the entire dataset for training and testing consecutively. In our experiments, quantitative assessment relied on three metrics: accuracy (Ac), sensitivity (Se), and specificity (Sp) [
41].
Here, Tp (Tn) represents the number of true positive (negative), and Fp (Fn) represents the number of false positive (negative) samples.
The above metrics summarize the overall accuracy in
Table 3. To examine the importance of different MRI-derived features, we employed ablation techniques to understand the contributions of individual ensemble components. Three different ablation scenarios were explored: individual modules, paired modules, and all three modules. For paired modules, we assessed the contribution of (1) higher-order texture and deeper features while excluding the ViT-derived features; (2) deeper and ViT-derived features while leaving out the texture features; (3) ViT-derived and texture features while excluding the CNN model. The performance of the ensemble model was compared, and the results using the public dataset are tabulated in
Table 3.
Furthermore, the confusion matrix (CM) and receiver operating characteristics (ROC), powerful tools for evaluating and comparing classification models, were used to perform another quantitative evaluation of the proposed methods. The first row of
Figure 4 shows the individual model’s CM, which is extremely useful for determining which classes were misclassified the most. Moreover, other metrics (e.g., precision, F1-score, and recall) may be calculated from a given matrix. The second row of
Figure 4 shows the ROC which provides robustness analysis using graphical illustrations for a given class. Technically, intermediate curve points are constructed by changing the decision threshold (i.e., the control parameter) to classify instances into positive or negative classes. Various trade-offs between Tp and Fn rates are generated by this threshold, and the curve is constructed. Additionally, broken black (diagonal) represents the random guess, for which ROCs falling below this line indicate poor model performance in identifying such classes. ROCs are thus valuable assessment tools showing how well an ML classifier distinguishes between different classes. The ROCs and CMs of the proposed ensemble method tested using the first (public) dataset are shown in the
Figure 5.
To demonstrate the effectiveness of the proposed architecture, the trained model on the public dataset was used to classify brain images from the local dataset. The accuracy is reported in the last row of
Table 3, and both the confusion matrix and ROC curves are shown in
Figure 6. The results obtained on the second (local) dataset demonstrate how powerful the proposed model is in generalizing to a different dataset. In total, the quantitative results and scores of the ROC and CMs document the higher performance of the proposed model in correctly classifying brain MRIs.
Finally, to emphasize the benefits of the proposed architecture, we compared the accuracy of our proposed model with its competitive recent literature models that partially utilized similar datasets or combined datasets. Some comparative models use the Figshare dataset compared to our study, which uses three datasets from public sources, including Figshare. We also compared against similar approaches that have adopted an aggregation of different datasets to ensure that our methodology’s efficacy and generalizability are fairly assessed. To strengthen the argument for our framework and demonstrate its advantages without being unfair to other authors, we deliberately avoided reimplementing the approaches or methodologies of competitive state-of-the-art (SOTA) models. Instead, we focused on comparing the accuracy of our proposed model with recent literature models that utilized similar datasets. This approach ensures a fair comparison while still providing valuable insights into how our framework performs within similar datasets and methodologies used by other researchers. The major factor for comparing classification results is accuracy. As a result, we considered the mean accuracy for the quantitative analysis.
Table 4 shows how the proposed approach compares to other methods tested on various brain tumor MRI data regarding accuracy.
5. Discussion
Precise diagnosis and classification of brain tumors is of utmost importance for early intervention. Clinically, various physical and neurological examinations are utilized for diagnosis, and biopsy remains the most reliable diagnostic method, i.e., the gold standard. However, it is invasive and has potential risks of bleeding, tissue injury, and functional loss. As a result, AI-based research utilizes medical imaging for its crucial role in determining the type and stage of the tumor and developing a treatment plan. Over the last decade, various works of literature have been developed that exploited different image modalities (e.g., MRI, CT, etc.) for the prompt identification of brain tumors; the authors of [
48] presented a comprehensive review. MRI is the preferred choice as it is a non-ionizing and non-invasive method. The main objective of this work is to develop a robust architecture for brain tumor identification using MRI. The focus is identifying prominent localized and non-localized information-rich features associated with disease, building upon public and local datasets for system training and validation. Thus, this work contributed to a system capable of classifying brain tumors based on rich-MRI-derived cues.
We have introduced a hybrid architecture for multi-classification of brain tumors integrating three learning modules to extract texture (radiomics) and deep hidden features, all combined by a feature weighing scenario to form a rich feature vector for tumor classification. Quantitatively, and as shown in
Table 3, the integrative efforts of the employed models demonstrate the superiority of our ensemble approach (accuracy ≥ 99%) for both datasets. The subsequent exploration of ablation techniques, dissecting the ensemble model under various scenarios, reveals that the combined ViT with texture features model outperforms others with an impressive overall accuracy of ∼95%.
Further,
Table 3 shows the summary of the evaluation metrics of the ensemble and ablated models. The results/performance of CNN-derived features (CnF) alone showed reduced performance compared with the hand-crafted features (HcF). This can partly be explained by the shallow nature of neural architecture and the integrative nature of three sets of radiomics (i.e., GLCM, GLRM, and LBP). The ViT-based (ViF) classification alone showed the best results, which is explained by its ability to capture long-range dependencies, making it suitable for recognizing complex patterns. The “CnF+HcF”, “CnF+ViF,” and “ViF+HcF” ablated models’ results show the performance of the ensemble model without the ViT, HcF, and CNN models, respectively.
Table 3 showed that the fused experiments with ViT-derived features increase the performance of the individual branch. This affirms the utility of the ViT-derived feature and the fusion strategy employed, where weights are assigned to each feature branch based on its prediction accuracy.
In addition to quantitative metrics, robustness analysis is also highlighted using ROC curves, as shown by the interconnected lines in
Figure 5 and
Figure 6. Generally, the assessment of model effectiveness using those curves is based on the area under the curve (AUC). The latter quantitatively assesses the model’s ability to identify a specific class with “1” and “0”, indicating the best and the worst performance, respectively. Notably, in the ROC curves for the first datasets, (
Figure 5) the normal and pituitary class reside closest to the top-left corner (AUC is ∼100%), while the meningioma and glioma class exhibits a smaller distance from this position. A similar observation can be drawn from
Figure 6 for the second dataset. The ROC curve and AUC scores show that our architecture correctly classified each class well. For individual model performance in predicting each class, the ROC curves in
Figure 4d–f also revealed that the ViT model performed the best in predicting all classes as it has curves that are more tilted to the top left corner, while the CNN ROC curves are the farthest from the top left corner amongst the three individual models. This is consistent with the results in
Table 3. The confusion matrix in
Figure 4a–c further evaluates the performance of the three models as it shows the proportion of instances each model predicted correctly.
The comparative accuracy, in
Table 4, contrasting the proposed method against recent literature work tested on various brain tumor datasets, highlight its advantages. Some of the compared methods (e.g., refs. [
11,
12,
13,
18,
19,
20,
26] were tested partially on the public dataset in
Table 1, while others (e.g., refs. [
21,
29] adopted a method similar to our approach of integrating various datasets to reduce model overfitting and avoid unbalanced data class scenarios. No previous studies have utilized the same three datasets used in our study. However, by employing this strategy, we aim to showcase the distinct contributions and advantages of our proposed model without directly competing with or undermining the efforts of other authors. While this may not represent the optimum comparative scenario, it offers valuable context by illustrating how our framework stands relative to recent work that shares similarities in dataset usage. This approach allows for a nuanced evaluation of our framework’s performance and position within the broader research landscape utilizing similar datasets, thus providing valuable insights for readers and researchers.
In our proposed framework, we emphasize critical issues concerning the scientific community but also offer a solution with far-reaching implications beyond brain tumor classification alone. While the presented architecture is designed specifically for brain tumor classification, its underlying principles and methodologies have the potential to be used in a variety of oncology applications. The weighted fusion of multiple learnable modules within our ensemble architecture enables the capture of both localized and long-range dependencies. The latter increases our system’s ability to identify complex patterns indicative of brain tumors with high sensitivity. The transferability of pre-trained models and feature extraction techniques suggests that they could be applied to other cancer types, opening the door to multi-modal and multi-organ image analysis frameworks.
Despite the promising results, our method has some limitations, including the reliance on a single image modality (i.e., MRI) for prediction, the lack of explainability and interpretability of the machine decisions, and the use of a single ML classifier. The above-mentioned limitations stand as a motivation for future improvement.