1. Introduction
In recent years, we have seen incredible progress in the field of big data applications and artificial intelligence, which become the main reason for the creation of ground-breaking technologies that link computers with the human brain while the mind-control wheelchair, prosthetic arms, home automation system, etc., are the innovations of BCI in healthcare sectors [
1,
2,
3]. The low potential signals generated in the brain during any neural activity have a significant amount of information, which we can decode in the field of BCI and convert that data into meaningful information.
In the case of Motor imagery, it only consists of those signal that involves activation of the neural system when a person without any body movement or physical activity imagines performing a task. When a person performs a task without moving his/her body physically then it is known as a cognitive process, and MI is defined as a cognitive process because it does not have any physical movement [
4,
5]. For example, in our case, the MI task in BCI competition III dataset IV-a is “right-hand movement” and “right-foot movement”. Given the characteristics required for motor imagery applications, which involve a low-cost, noninvasive, and high temporal resolution signal retrieval method [
3,
6], electroencephalography (EEG) stands out as the optimal selection due to its inherent possession of these characteristics. Once the signal is obtained from the subject performing the MI task, another challenge is to accurately interpret it into meaningful classification [
7].
Motor imagination is a very important concept in the Brain–Computer interface (BCI), it is a rapidly advancing topic of research that aims to provide direct communication between the brain and the real world. In the case of the human brain, it is a sophisticated organ with extensive networks that manage a variety of tasks, including motor control. If a person imagines herself/himself conducting a motor activity, such as moving their hand or foot, the brain areas linked with real motor neuron execution are active. These imagined actions create brain signals that BCI devices can collect and identify. One of the primary benefits of motor imaging in BCI is that it does not need real body motion, making it especially beneficial for those with motor limitations. BCI systems that use motor imagery have shown potential in offering communication and control actions for patients suffering from spinal cord injuries, amyotrophic lateral sclerosis (ALS), or paralysis.
Electroencephalography (EEG), a non-invasive technique that captures the electrical responses of the brain using electrodes applied to the scalp is commonly used to integrate motor imagery in BCI. Certain frequency bands, such as beta and mu rhythms, show amplitude and synchronization alterations during motor imagery. Then, these modifications are employed as features to interpret the user’s intents. Numerous facets of motor imagery-based BCI research have been investigated, such as machine learning algorithms for classification, feature extraction approaches, and signal processing strategies. Technological developments in signal processing have improved the ability to extract appropriate data from EEG signals, and machine learning techniques are essential for converting these signals into instructions that devices can understand. Research has shown that motor imagery-based BCIs are feasible for a range of uses. For example, people with locked-in syndrome or paralysis have effectively employed motor imagery to operate robotic arms, computer cursors, and even spell words using equipment. The quality of life for those with significant motor disabilities might be greatly enhanced by using motor imagery for BCI applications. The variance of individual brain patterns, the requirement for intensive user training, and the sensitivity of EEG signals to artifacts as well as noise are challenges faced by motor imagery-based BCIs. Through advancements in adaptive algorithms, neurofeedback training paradigms, and signal processing, researchers are actively attempting to overcome these problems.
In this study, we begin with the sorting and denoising of Motor Imagery EEG data, which are crucial for enhancing the quality of the signals and reducing noise that can interfere with accurate classification. Following this preprocessing step, we extract features using self-attention deep learning methods, which allow us to identify and prioritize the most relevant aspects of the EEG signals.
After feature extraction, we employ various feature selection techniques, including Mutual Information and Recursive Feature Elimination (RFE), to further refine the dataset. These methods help in identifying the most informative features, thereby enhancing the classification performance while reducing computational complexity.
Subsequently, we classify the processed data using both traditional machine learning methods and the latest deep learning techniques. In the machine learning domain, we utilize classifiers such as Support Vector Machines (SVM), k-Nearest Neighbors (k-NN), and Linear Discriminant Analysis (LDA). For deep learning, we implement advanced models including Recurrent Neural Networks (RNN), Gated Recurrent Units (GRU), WebNet, CatBoost, and ResNet. By integrating these approaches, we aim to provide a comprehensive evaluation of the effectiveness of each model in classifying Motor Imagery EEG data, ultimately contributing to advancements in Brain–Computer Interface (BCI) systems.
To the best of our knowledge, this is the first time that we are using a novel Attention Sift Network (AtSiftNet) technique to enhance the classification accuracy of motor imagery tasks in the field of Brain–Computer Interface. The conventional methods such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) present several limitations in this context [
8,
9]. CNNs, while effective at capturing spatial patterns, often require large datasets and extensive training time, making them computationally expensive and time-consuming. Moreover, they can struggle with the inherent variability and noise in EEG signals. RNNs, on the other hand, are designed to process sequential data but often face challenges in retaining long-term dependencies, limiting their ability to accurately process the temporal dynamics of motor imagery data. AtSiftNet addresses these limitations by introducing an architecture that leverages self-attention mechanisms to focus on the most relevant features of the signal while reducing overall computational complexity. This approach not only improves the accuracy of motor imagery classification but also accelerates the processing time, making it more suitable for real-time applications. By bridging this gap, AtSiftNet offers a more efficient and effective solution compared to traditional CNNs and RNNs in motor imagery-based BCI systems [
10]. The fundamental premise of our approach lies in leveraging the unique capabilities of self-attention for MI data analysis. Self-attention mechanisms have demonstrated exceptional efficacy in capturing long-range dependencies within sequential data; by focusing on capturing relevant spatial and temporal relationships within the MI data, we aim to unlock richer and more discriminative features that can significantly improve classification accuracy. The feature selection phase of our methodology is a critical step in distilling the most informative elements from the extracted features. We employed a diverse set of eight feature selection methods to systematically identify and prioritize features that contribute most significantly to the classification task. These methods encompass Mutual Information Feature Selection (MIFs) [
11], Recursive Feature Elimination (RFE), Linear Regression (LR), Neighborhood Component Analysis (NCA) [
12], Relief Feature Selection (ReliefF) [
13], Independent Component Analysis (ICA) [
14,
15], Principal Component Analysis (PCA) [
16], and Correlation Feature Selection (Cor) [
17].
The inclusion of these diverse feature selection techniques ensures a comprehensive exploration of the feature space, allowing us to capture both linear and non-linear relationships within the MI data. Moving beyond feature extraction and selection, our paper introduces a robust ensemble of five classification methods. These methods, namely Support Vector Machine (SVM) [
18], Linear Discriminant Analysis (LDA) [
19], k-Nearest Neighbors (KNN) [
20], Random Forest (RF) [
21], and XGBoost Classification [
22,
23], represent a well-rounded suite of algorithms known for their efficacy in handling complex classification tasks.
The integration of multiple classifiers enables a comparative analysis of their performance, shedding light on the strengths and weaknesses of each method in the specific context of MI data classification. By combining state-of-the-art feature extraction, diverse feature selection strategies, and a suite of advanced classifiers, our research aims to contribute to the field of BCI systems. The interdisciplinary nature of our approach, merging techniques from signal processing, machine learning, and neuroscience, reflects a holistic effort to enhance the interpretability and accuracy of MI data analysis. Through this exploration, we aspire to pave the way for advancements in BCI systems, opening new possibilities for real-world applications in healthcare, rehabilitation, and human-computer interaction.
3. Results and Discussion
3.1. Analysis of Five Classifiers
In this study, we assessed the efficacy and accuracy of five classification algorithms—Linear Discriminant Analysis (LDA), XGBoost, Support Vector Machine (SVM), k-nearest neighbors (KNN), and Random Forest (RF)—for MI classification within the framework of five subjects (aa, al, av, aw, ay) within our dataset. The accuracy results revealed distinct patterns among the classifiers. Linear Discriminant Analysis (LDA) showed consistent accuracy, while XGBoost demonstrated competitive performance. The detailed 3D bar graph of the Accuracy and classification method is shown in
Figure 5.
In this novel method, the accuracies we obtain from all classification methods were outstanding, but it was the Support Vector Machine (SVM) that consistently outperformed other classifiers, showcasing remarkably high accuracy across all subjects. Notably, SVM achieved near-perfect accuracy for subjects aa, av, and ay, indicating its robustness in discerning subtle patterns within diverse datasets. Random Forest (RF) also exhibited strong performance, particularly excelling in subject av. While not consistently surpassing SVM, RF demonstrated competitive accuracy across subjects, highlighting its adaptability to different datasets. In conclusion, both SVM and RF emerge as strong candidates for the best classifiers, with SVM demonstrating exceptional accuracy across subjects, and RF showcasing robust performance, especially in subject av. The choice between SVM and RF may hinge on specific considerations such as interpretability, computational efficiency, or the need for a well-generalized model. Further analysis and validation could provide additional insights into the nuanced performance of these classifiers for motor imagery classification tasks. We also calculated the Sensitivity and Specificity of each subject with respect to classifiers as shown in
Figure 6 and
Figure 7.
In the realm of sensitivity and specificity analysis for motor imagery classification in this study, two classifiers, k-nearest neighbors (KNN) and Random Forest (RF), consistently demonstrate superior performance compared to their counterparts. When focusing on sensitivity, which measures the ability of a classifier to correctly identify positive instances, both KNN and RF emerge as robust choices. Their capacity to effectively capture and discern subtle patterns within the motor imagery data is reflected in higher sensitivity values across diverse subjects. Conversely, in the evaluation of specificity, which gauges a classifier’s ability to correctly identify negative instances, Support Vector Machine (SVM) and RF exhibit notable prowess. Particularly, SVM stands out for its ability to minimize false positives, showcasing strong specificity across various subjects. RF, while excelling in sensitivity, also proves to be adept at maintaining specificity, making it a versatile choice for balancing the classification performance on both positive and negative instances.
3.2. A Comparative Analysis of Feature Selections
After feature extraction by using a self-attention deep learning model, the behavior of feature selection techniques that were used in this method is really important for our study. In this paper, we conducted a comprehensive evaluation of eight feature selection methods applied to MI classification tasks using the BCI Competition III dataset IV-a. After the selected feature extraction technique self-attention, eight feature selection techniques were employed on five subjects (aa, al, av, aw, ay) followed by the feature selection technique of Recursive Feature Elimination (RFE), Correlation feature selection (Cor), Linear Regression (LR), Neighborhood Component Analysis (NCA), Relief Feature Selection (ReliefF), Mutual information (MIFs), Independent Component Analysis (ICA), and Principal Component Analysis (PCA). The classification performance was assessed using a Support Vector Machine (SVM) as a best-performing classifier, and the results were analyzed to identify the most effective feature selection methods.
The results demonstrate that ReliefF, MIFs, ICA, and PCA consistently yield high classification accuracies across all subjects as shown in
Figure 8. These methods exhibit superior performance compared to RFE, Cor, LR, and NCA. Notably, mutual information consistently achieves near-perfect accuracy, indicating its robustness in extracting relevant features for SVM classification in MI tasks. The observed trends suggest that the information captured by ReliefF, MIFs, ICA, and PCA is particularly discriminative for our chosen methodology.
The effectiveness of these methods may stem from their ability to uncover essential patterns and reduce dimensionality, leading to improved classification performance, as demonstrated in
Table 1.
In the feature selection phase of our experiments, Recursive Feature Elimination (RFE) was applied following self-attention-based feature extraction. However, certain subjects, such as al, aw, and ay, exhibited lower classification accuracies, with values dropping below 50%. This reduction in accuracy is primarily due to the poor separation between the clusters of classes 1 and 2 in the feature space, which limits the effectiveness of RFE in identifying the most discriminative features.
RFE operates by recursively eliminating less important features based on the performance of a predictive model. However, the method is heavily reliant on the quality and separability of the feature set. In cases where class boundaries are ambiguous, or when features are highly redundant, RFE may fail to select the optimal subset of features, leading to suboptimal classification performance. Additionally, RFE can be sensitive to noise and computationally expensive, further impacting its performance in datasets with overlapping or poorly separated classes. These limitations highlight the need for more sophisticated or complementary feature selection methods when working with complex datasets.
The ratio of true positives to the sum of true positives and false negatives is used to know the value of sensitivity, also known as the true positive rate, while specificity, often known as the true negative rate, calculates the model’s ability to properly detect negative events. It is determined by dividing the number of true negatives by the total number of true negatives and false positives. In this paper, the 3D bar graph depicting sensitivity and specificity for each feature selection method across subjects reveals insightful patterns as shown in
Figure 9 and
Figure 10. ReliefF, Mutual Information, Independent Component Analysis (ICA), and Principal Component Analysis (PCA) have also emerged as standout performers in achieving a balance between sensitivity and specificity. These methods consistently exhibit high accuracy in identifying both positive and negative instances, making them well-suited for MI tasks. The robust performance of these feature selection techniques suggests their potential for enhancing the reliability of BCI systems, contributing to improved overall classification outcomes.
3.3. Comparison between Number of Selected Features
In our previous experiments conducted on the Motor Imagery (MI) dataset, Independent Component Analysis (ICA), Principal Component Analysis (PCA), Mutual Information Feature Selection (MIFs), and Relief Feature Selection (ReliefF) emerged as the most effective techniques for feature selection in the context of our MI task. Building upon these successful outcomes, we intend to further refine our feature set by selecting one feature from each trial and fifteen features from each trial.
In
Table 2, where one top feature is selected from each trial of our MI task after feature extraction, distinct patterns emerge across the feature selection methods (MIFs, ReliefF, PCA, ICA) and machine learning models (SVM, LDA, KNN, RF, XGBoost). Notably, ReliefF consistently yields high accuracy across models, ranging from approximately 97.11% to 99.01%.
MIFs exhibit varied performance, with SVM achieving notably high accuracy values (up to 99.78%), while PCA consistently performs well, particularly with RF and XGBoost reaching accuracy values as high as 99.89%. ICA demonstrates consistently high performance across all models, with accuracy values ranging from 93.17% to 99.98%.
In
Table 3, where the top 15 features are selected from each trial, there is a discernible improvement in accuracy for most feature selection methods and models. ReliefF, in particular, shows a notable boost in accuracy, ranging from approximately 97.11% to a perfect accuracy of 100% across all subjects.
MIFs, when selecting the top 15 features, result in a significant accuracy improvement, especially for SVM, achieving perfect accuracy (100%) in the case of subject “ay” and good results in other subjects. PCA maintains its strong performance, with high accuracy values across all models, including perfect accuracy for RF (100%) for all subjects.
ICA continues to exhibit high accuracy, with values ranging from 90.38% to a perfect 100%. Overall, selecting the top 15 features generally leads to improved accuracy compared to choosing only one top feature from each trial.
The results provide a deep observation into the effectiveness of different feature selection techniques and the impact of the number of features selected on the performance and accuracy of machine learning models in the Motor Imagery (MI) task.
In this paper, the feature selection method Mutual Information (MIFs) has consistently emerged as the most effective feature selection method for our research paper. After using various machine learning models, including SVM, LDA, KNN, RF, and XGBoost, MIFs consistently demonstrated the best performance in terms of accuracy for our MI data. The results consistently show high accuracy values, ranging from 92.31% to a perfect 100%, depending on the specific subjects and scenarios, such as selecting one top feature or the top 15 features from each trial of our MI data. This robust and steady performance underscores the importance of MIFs in optimizing feature sets for enhancing the predictive capabilities of the machine learning models that we used in our study. The findings reinforce the prominence of Mutual Information Feature Selection as a powerful technique in the domain of Motor Imagery analysis, providing a reliable and effective approach for feature selection in our experimental context.
In
Figure 11, we present a graphical comparison illustrating the impact of selecting the top 1 feature and the top 15 features from each trial using various feature selection methods—MIFs, ReliefF, PCA, and ICA. The line plot distinctly represents the accuracies achieved by each feature selection method for both scenarios. The top 1 selected feature from each trial is depicted with solid lines, while the top 15 selected features are represented by dotted lines. The color scheme enhances the visual distinction, with MIFs depicted in yellow, ReliefF in green, PCA in cyan, and ICA in magenta. Notably, the graph demonstrates a consistent trend across all feature selection methods. The accuracies obtained with the top 15 selected features consistently outperform those achieved with only the top 1 selected feature from each trial. This observation aligns with the expectation that a richer set of features, as represented by the top 15 features, provides a more comprehensive and informative input to the machine learning models.
The dotted lines, indicating the top 15 features, consistently surpass the solid lines corresponding to the top 1 feature, underscoring the significance of selecting a more extensive set of features for enhanced model performance. This visual representation reinforces the efficacy of MIFs, ReliefF, PCA, and ICA in capturing and selecting features that contribute significantly to the task at hand. The distinct color-coded lines facilitate an easy comparison of the impact of different feature selection methods on model accuracy, providing valuable insights into the performance gains achieved by selecting a larger feature set in the context of our Motor Imagery task.
3.4. Correlation Matrix Analysis for Redundancy and Independence of Selected Features
The correlation matrix shown in the
Figure 12 and
Figure 13 represents the pairwise relationships between the top 15 and top 1 selected features. Each cell in the matrix indicates the strength and direction of the linear relationship between two features, with values ranging from −1 to 1. A value of 1, indicated by the red diagonal, represents a perfect positive correlation where a feature is correlated with itself. Positive correlations (in red shades) imply that as one feature increases, so does the other, while negative correlations (in blue shades) suggest an inverse relationship where an increase in one feature corresponds to a decrease in the other.
In the case of the top 15 selected features many of the off-diagonal values show moderate to high positive correlations (ranging between 0.6 and 0.8), suggesting that several features are closely related to one another and may provide redundant information. For example, there are several clusters of correlations around 0.7 to 0.8, indicating that these feature pairs are strongly positively correlated. A few features, like those in the bottom-right portion of the matrix, show slightly lower correlations, suggesting more independent contributions to the dataset as shown in
Figure 12.
In the case of the top 1 selected features, most of the off-diagonal values show moderate to strong positive correlations, with many values clustering between 0.6 and 0.9 as shown in
Figure 13. This indicates that several of the selected features tend to increase or decrease together, suggesting that they may be capturing similar patterns in the dataset. For example, features in the top-left and bottom-right sections of the matrix show correlations close to 0.7 or higher, implying that these feature pairs are likely redundant.
Moreover, while some features exhibit moderate correlations, providing more diverse information, there are no significant negative correlations in the matrix. This suggests that the dataset lacks pairs of features that consistently behave in opposing directions. The absence of both strong negative correlations and near-zero correlations further highlights the potential redundancy among the selected features. When multiple features are highly correlated, as seen here, it may introduce issues like multicollinearity, especially in predictive modeling. Multicollinearity can distort the interpretation of models, particularly in linear regression, and may lead to overfitting where the model becomes too reliant on a small set of redundant features.
3.5. Five-Fold Validation
In our comprehensive analysis of feature selection methods for our MI task, we employed a rigorous five-fold validation strategy to evaluate the performance of four distinct techniques, Mutual Information Feature Selection (MIFS), Relief Feature Selection (ReliefF), Independent Component Analysis (ICA), and Principal Component Analysis (PCA). The first thing we conducted was to identify the most effective feature selection method that could enhance the accuracy of our MI task classification. Remarkably, each of these methods revealed promising results, showing the effectiveness of their respective approaches. Upon conducting the five-fold validation experiments, it became evident that the top-performing feature selection method varied across trials. The individual trials revealed that Mutual Information (MIFS), ReliefF, ICA, and PCA each had instances where they outperformed the others as the best feature selection method for each trial. This variability underscores the complexity of MI data and the importance of employing diverse feature selection techniques to adapt to different characteristics present in the data.
Surprisingly, not only did we observe strong accuracy results with our top-performing feature selection method, but a deeper exploration into the top 15 features selected by each method also revealed compelling performance. The inclusion of additional features beyond the single best feature showcased the robustness of the selected features and their collective impact on classification accuracy. This observation is particularly noteworthy as it suggests that a more comprehensive feature set, encompassing the top 15 selected features, consistently contributed to favorable outcomes in our five-fold validation experiments.
Table 4 provides a detailed overview of the accuracies obtained through five-fold validation for both the top one best-selected feature and the top 15 best-selected features for each feature selection method. The tabulated results allow for a nuanced comparison between the performances of these two feature selection strategies.
Figure 14 complements the tabular data by offering a visual representation of the accuracy trends, facilitating a quick and intuitive understanding of the comparative performances. The findings presented in
Table 4 and
Figure 14 collectively suggest that while the top one best-selected feature is crucial and often yields commendable results, the broader selection of the top 15 features consistently delivers competitive accuracies. This insight underscores the importance of exploring not only the best individual features but also the relation among a selected set of features in our MI task.
Overall, our in-depth analysis of feature selection methods for MI task classification has revealed a dynamic landscape where different trials may benefit from distinct feature selection techniques. The robustness of the top 15 selected features highlights the potential for a more inclusive approach to feature selection, leveraging the collective strength of a broader feature set. These results contribute valuable insights to the field of EEG signal processing, offering researchers and professional advisors of the field of BCI to select appropriate feature selection methods based on the specific characteristics of the MI data.
3.6. Effect of Denoised and Noised MI Data
Multi-scale Principal Component Analysis (MSPCA) has proven to be particularly effective for denoising EEG signals compared to other traditional methods. While conventional denoising techniques like wavelet thresholding or basic filtering methods focus on removing noise, they often risk distorting the underlying EEG signal or removing important low-amplitude components. MSPCA, however, leverages both the multi-scale nature of wavelets and the dimensionality reduction capabilities of PCA to efficiently separate noise from a signal without significantly altering the relevant features of the EEG data.
By decomposing the signal into different scales, MSPCA isolates noise present at various frequencies and removes it while preserving the underlying structure of the EEG signal. This makes it especially suited for handling the complex, non-stationary nature of EEG data, where both low-frequency brain activity and high-frequency noise coexist. In our study, the MSPCA-denoised signals demonstrated substantial improvement in signal quality, as evidenced by a notable increase in the signal-to-noise ratio (SNR) and a reduction in root mean square error (RMSE) compared to the raw signals.
Additionally, when applied to the classification pipeline, MSPCA-denoised signals consistently yielded higher accuracy across all classifiers, outperforming results obtained from other denoising techniques. The improvement in classification accuracy underscores the effectiveness of MSPCA in preserving the essential features of the EEG signal that contribute to model performance. Thus, MSPCA not only excels in noise reduction but also enhances the overall accuracy of EEG-based classification tasks, making it a superior method for EEG signal denoising compared to more conventional approaches.
In this paper, we performed a comprehensive analysis of Motor Imagery (MI) data, by comparing the performance between denoised and non-denoised datasets using the Multi-Scale Principal Component Analysis (MSPCA) method for denoising. Our primary goal was to assess the impact of denoising on feature selection and classification accuracy. The results that we obtained from the experiment shed light on the effectiveness of the MSPCA denoising technique in improving the overall performance of the MI classification task.
First, we began by selecting only the single best feature from each trial of our MI data and observed notable improvements in accuracy when using denoised MI data. The average accuracies expressed in percentages, 92.33%, 88.18%, 94.13%, 90.31%, and 99.87%, were for subjects aa, al, av, aw, and ay, respectively. Contrarily, when we used non-denoised data, the accuracies dropped noticeably to 71.12%, 64.59%, 85.16%, 80.56%, and 85.76% for each subject. This visible difference underscores the effectiveness of the MSPCA denoising method in enhancing the discriminatory power of selected features and subsequently improving the accuracy of the MI classification task that we used in this paper. Next, we extended our investigation to the scenario of selecting the top 15 best features from each trial. The results revealed an even more pronounced advantage for denoised data. When employing denoised MI data, the accuracies reached 97.33%, 95.18%, 98.43%, 96.31%, and 100% for subjects aa, al, av, aw, and ay, respectively. In contrast, without denoising, the corresponding accuracies were notably lower at 78.21%, 80.36%, 88.76%, 82.76%, and 89.79%.
This substantial improvement reaffirms the MSPCA denoising technique’s ability to extract and retain relevant information from MI signals, allowing the classifier to make more accurate predictions. Central to our experimental setup was the utilization of Support Vector Machines (SVM) as the classifier and Mutual Information (MI) as the feature selection technique. This combination proved effective in harnessing the denoised information, and the relationship between denoising and sophisticated classification methodologies. The results not only validate the robustness of SVM but also emphasize the crucial role of feature selection in optimizing classification accuracy.
In
Figure 15, a compelling graphical comparison unfolds, illustrating the impact of denoising on Motor Imagery (MI) data accuracy. The solid red line represents the accuracy trend derived from denoised MI data, while the green dotted line mirrors the accuracy of non-denoised data. This visual representation immediately communicates the profound effect of the denoising process on the classification performance of the MI task. The solid red line exhibits a notably higher and smoother trajectory, reflecting the accuracy achieved with denoised MI data and the red line consistently maintains a superior position compared to its non-denoised counterpart. This result implies that the denoising method, likely the Multi-Scale Principal Component Analysis (MSPCA) as mentioned earlier, effectively mitigates the adverse impact of noise, leading to enhanced feature discrimination and, consequently, improved accuracy.
Contrastingly, the green dotted line, representing the accuracy of non-denoised MI data and the visual disparity between the two lines underscores the significance of denoising techniques in the preprocessing phase of MI data analysis.
Figure 15 vividly illustrates the performance gap between denoised and non-denoised MI data. The solid red line serves as a testament to the effectiveness of denoising in elevating the accuracy of the classification task. This comparison not only substantiates the importance of preprocessing methods in enhancing the quality of MI data but also underscores the potential of denoising techniques to contribute significantly to the reliability of Motor Imagery-based applications. The visual clarity provided by this graphical representation emphasizes the practical impact of denoising on the robustness and accuracy of MI data analysis, offering valuable insights for researchers and practitioners in the field of neurotechnology.
Our application of MSPCA as a denoising technique reflects the practical integration of established methods to enhance the extraction of meaningful features from MI data, mitigating the adverse effects of noise and enhancing the discriminative power of selected features. This finding underscores the importance of preprocessing techniques, such as denoising, in preparing raw MI data for effective utilization in machine learning tasks. Moreover, the substantial increase in accuracy when selecting the top 15 features further highlights the advantage of considering a broader set of features, made possible by our denoising approach. Our experiment showcases the transformative impact of MSPCA-based denoising on MI data, resulting in significantly improved classification accuracies. The findings underscore the potential of this approach to contribute to advancements in neurotechnology, neurorehabilitation, and Brain–Computer Interfaces, where accurate and reliable MI classification is paramount.
3.7. Deep Learning Classification
In this study, we utilized five advanced deep learning models—RNN, GRU, WebNet, CatBoost, and RestNet—to evaluate the performance of AtSiftNet after processing our data through Mutual Information feature selection. This method allowed us to identify the most relevant features, enhancing the classifiers’ ability to generalize and improve classification accuracy. The results, presented in both
Table 5 and
Table 6, illustrate the accuracy of each model with both the top one and top 15 selected features.
The results demonstrate a marked improvement in overall accuracy when transitioning from the top one selected features to the top 15 selected features across all models. For instance, the average accuracy for RNN increased from 73.638% to 78.67%, while GRU similarly rose from 84.43% to 88.58%.
CatBoost consistently outperformed the other models, achieving an average accuracy of 94.86% with the top one features and further increasing to 98.27% with the top 15 features. RestNet also exhibited robust performance, with average accuracies of 93.37% for the top one features and 98.126% for the top 15 features. Notably, RNN, while showing improvement, remained the lowest-performing model in both tables, suggesting a need for further optimization of its architecture.
The results further reveal that the deep learning methods employed in this study—RNN, GRU, WebNet, CatBoost, and RestNet—differ in their capacity to leverage the selected features effectively. For example, GRU and WebNet demonstrated substantial gains, with GRU’s accuracy improving from 84.43% to 88.58% and WebNet’s from 88.52% to 93.288%. These findings highlight the effectiveness of Mutual Information in identifying relevant features that enhance classification accuracy, particularly for the more advanced models like CatBoost and RestNet, which consistently deliver superior results in classification tasks.
3.8. Algorithmic Complexity
The computational efficiency of the proposed method is assessed through various performance metrics, as depicted in
Figure 16,
Figure 17 and
Figure 18. The feature extraction time for each subject, denoted as “aa”, “al”, “av”, “aw”, and “ay”, is presented in
Figure 16. Notably, subject “aa” exhibits the longest feature extraction time, totaling 1.32 s, followed closely by subject “al” with 1.30 s. In contrast, subjects “av”, “aw”, and “ay” demonstrate significantly shorter feature extraction times of 0.63, 0.32, and 0.15 s, respectively. This variance can be attributed to the differing number of trials conducted by each subject, resulting in variations in the number of features extracted.
Moving on to the classification stage,
Figure 17 illustrates the training times for Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), k-Nearest Neighbors (KNN), Random Forest (RF), and XGBoost. The highest training time across all classifiers is observed for subject “al” with SVM, totaling 2.12 s. Meanwhile, subjects “av”, “aw”, and “ay” exhibit relatively lower training times, showcasing the computational effectiveness of the proposed method in our paper. The training times for LDA, KNN, RF, and XGBoost also vary across subjects, with the algorithms demonstrating distinct computational demands based on the underlying data characteristics.
In
Figure 18, the testing times for each subject using the same set of classifiers are presented in a bar chart. In the case of subject “aa”, it displays the longest testing time across all classifiers, this subject with SVM yielding a testing time of 0.15 s. Conversely, subjects “aw” and “ay” reveal notably shorter testing times, emphasizing the efficiency of the proposed method in real-time clinical applications of MI data.
To provide a comprehensive overview of the computational performance, and how much time our proposed algorithm takes,
Figure 16,
Figure 17 and
Figure 18 include bar charts for feature extraction time, training time, as well as testing time. These visual representations in bar charts offer a clear comparison of the computational demands imposed by different subjects and classifiers. The results underscore the practicality and efficiency of the suggested method, especially when compared to other feature extraction techniques. Thus, the novel proposed technique in our study demonstrates promising computational efficiency in the context of feature extraction, training, as well as testing across multiple subjects. The ability to handle varying trial sizes and feature sets makes it a viable candidate for real-time clinical applications. The thorough analysis presented in
Figure 16,
Figure 17 and
Figure 18 provides valuable insights into the algorithm’s performance, aiding researchers and practitioners in understanding its strengths and potential areas for improvement.
3.9. ROC Curves and AUC Values
The performance of the classifier and feature selection technique was commonly evaluated by the Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) values. Where ROC is the graphical representation of the positive rate (sensitivity) and false positive rate (specificity) trade-off between them among the different threshold values, while AUC is a scalar value under the curve that shows the overall performance of any classifier or feature selection method; values range between 0 and 1.
The ROC curves and corresponding AUC values provide a comprehensive evaluation of various feature selection techniques applied in our experiment. Each technique, namely Linear Regression, ReliefF, Mutual Information, Correlation, Recursive Feature Elimination (RFE), Principal Component Analysis (PCA), Neighborhood Component Analysis (NCA), and Independent Component Analysis (ICA), exhibits distinctive performance characteristics.
Linear Regression and Correlation with AUC values of 0.59 demonstrate modest discriminative power. These methods may struggle to capture the complex relationships within the data, resulting in suboptimal predictive performance. On the contrary, ReliefF achieves an AUC of 1.00, indicating perfect discriminatory ability. ReliefF excels in identifying relevant features by considering feature interactions and is particularly robust in scenarios with intricate relationships.
Mutual Information, PCA, and ICA also exhibit outstanding performance, boasting AUC values of 1.00. These techniques excel in capturing non-linear dependencies and intricate patterns within the data, making them robust choices for feature selection in scenarios where feature interactions are crucial. The superior performance of Mutual Information, PCA, and ICA underscores their ability to discern relevant features effectively, contributing to the overall predictive power of the model.
RFE, with an AUC of 0.44, reveals comparatively lower discriminative ability. This technique might struggle in scenarios with high dimensionality or intricate feature dependencies, impacting its performance in accurately selecting relevant features. NCA, with an AUC of 0.86, demonstrates good discriminative power but falls short of the perfect scores achieved by ReliefF, Mutual Information, PCA, and ICA.
In the visual representation provided in
Figure 19, the ROC curves further illustrate the trade-offs between sensitivity and specificity for each feature selection technique. Techniques with higher AUC values exhibit curves that approach the upper-left corner, indicating superior performance. The stark differences in AUC values highlight the varying efficacy of each method in capturing the underlying structure of the data. Thus, the choice of feature selection technique significantly influences the model’s predictive performance.
ReliefF, Mutual Information, PCA, and ICA emerge as top performers, excelling in capturing relevant features and intricate relationships within the data. These findings underscore the importance of thoughtful feature selection in enhancing the overall robustness and accuracy of predictive models in the context of our experiment.
The AUC values obtained from our experiment provide valuable insights into the performance of different classifiers across subjects and feature selection strategies in the context of Motor Imagery (MI) tasks. We conducted two sets of experiments, one utilizing the top-best single selected feature from each trial and another employing the top 15 selected features from each MI task trial, all based on Mutual Information (MIFs). The ROC graph of the 15 best-selected features using mutual information as a feature selection technique is shown in
Figure 20. In the case of using the top-best single selected feature, SVM consistently demonstrated exceptional performance across all subjects, achieving perfect AUC values of 1. This suggests that SVM excels in accurately discriminating between classes when utilizing the most informative feature from each trial. This robust performance is noteworthy and underscores SVM’s capability to effectively leverage individual features for precise classification in MI tasks.
LDA also demonstrated strong performance, with AUC values ranging from 0.77 to 0.99 across subjects. While slightly below the perfect score achieved by SVM, LDA’s ability to capture discriminative information from individual features is evident. KNN and XGBoost, too, demonstrated consistent high AUC values, indicating their effectiveness in leveraging the top-best single selected feature for accurate classification. RF, although performing well, exhibits slightly lower AUC values compared to SVM, LDA, KNN, and XGBoost.
When considering the top 15 selected features from each MI task trial, the classifiers’ performance dynamics shift. SVM maintains strong discriminatory power with AUC values ranging from 0.95 to 1, reaffirming its resilience when dealing with a larger feature set. Interestingly, LDA’s performance fluctuates, demonstrating a wider range of AUC values across subjects. While still effective, it suggests that LDA may be more sensitive to the composition of the feature set. KNN and RF continue to perform admirably with perfect AUC scores for most subjects. XGBoost exhibits strong performance but with slight variability, showcasing its sensitivity to feature selection.
The contrast between the two scenarios (top-best single feature vs. top 15 features) emphasizes the impact of feature selection on classification outcomes.
Figure 21 and
Figure 22 visually represent the AUC values, providing a clear comparison between the classifiers and the two feature selection scenarios. These visualizations highlight the strengths and potential limitations of each classifier and underscore the importance of thoughtful feature selection in optimizing classification outcomes for MI tasks. Hence, our experiments show the performance of classifiers across different subjects and feature selection strategies in MI tasks. SVM consistently stands out as a robust performer, particularly when leveraging individual features, while other classifiers like LDA, KNN, RF, and XGBoost demonstrate varying degrees of effectiveness depending on the feature selection strategy employed.
3.10. Comparison with Other Methods
In this comprehensive study, the focus was on evaluating the accuracy of a proposed method across multiple subjects, denoted as aa, al, av, aw, and ay. The results reveal impressive performance metrics, with individual subject accuracies ranging from 99.68% to a perfect 100%. Specifically, subjects aw and ay achieved flawless accuracy rates of 100%, underlining the robustness of the proposed self-attention feature extraction method. Here, subject “aa” shows outstanding accuracy, leading all other state-of-the-art methods with a remarkable 99.96%. Subject “av” also demonstrates outstanding performance, with a noteworthy accuracy of 99.98%, leading all other state-of-the-art methods in the case of the MI task.
The overall average (Avg) accuracy for all subjects was calculated at an impressive 99.924%, illustrating the consistency and solidity of the proposed method, as demonstrated in
Table 7. A notable aspect of the study is the achievement of a 100% classification accuracy for subjects aw and ay, showcasing the efficacy of the proposed method in handling specific subjects with great precision. The attention to detail extends to the comparison of the proposed method with others, revealing that the method attains the highest average classification accuracy of 99.55%, leading all other studies in the case of BCI competition III dataset IVa.
One exceptional aspect of the proposed method is its consistency, as evidenced by the low standard deviation (Std) value of 0.52 in our case. This low variability indicates the reliability and stability of the results across all subjects of the MI dataset. Such consistency is crucial in applications where precision and repeatability are paramount.
Thus, the study provides compelling evidence for the effectiveness of the proposed self-attention feature extraction method in achieving high classification accuracies across a diverse set of subjects. The meticulous comparisons with state-of-the-art methods, along with the detailed analysis of individual subject performances, collectively contribute to positioning the proposed method as a standout approach in the field, demonstrating not only superior accuracy but also remarkable consistency in its results.
3.11. Future Curiosity
While our present research has yielded promising results and demonstrated the efficacy of the proposed self-attention feature extraction method in binary classification tasks across subjects aa, al, av, aw, and ay, it is essential to acknowledge the existing limitations and chart a course for future directions. As we reflect on the advantages of our approach, it becomes apparent that the scope of our investigation has been confined to specific EEG domains, namely the subjects mentioned, without exploring broader applications in domains such as dementia, schizophrenia, alcoholism, and stroke disorders.
One significant avenue for future exploration lies in extending the application of our method to diverse EEG domains. By diversifying the scope of our research, we can unlock the potential of the proposed self-attention feature extraction method to contribute to the understanding and diagnosis of neurological conditions beyond the current binary classification framework. Investigating the applicability of our method in domains characterized by distinct EEG patterns and complexities could open new avenues for diagnostic and therapeutic interventions. For instance, exploring its efficacy in identifying patterns associated with dementia or schizophrenia could pave the way for novel insights and advancements in these critical areas of neuroscience.
Another critical dimension for future research involves transcending the binary classification paradigm. While our method has excelled in distinguishing between two classes, the transition to multiclass classification scenarios holds great promise [
36,
37]. Expanding the model’s capabilities to accommodate multiple classes would enhance its versatility and applicability across a broader spectrum of EEG-based applications. This evolution could involve adapting the self-attention mechanism to effectively discern and classify EEG patterns associated with different cognitive states or neurological conditions, contributing to a more nuanced and comprehensive understanding of brain activity.
Furthermore, our future research endeavors aim to explore subject-independent experiments. Currently, our methodology has been tailored to the characteristics of specific subjects (aa, al, av, aw, and ay). However, extending the applicability of our method to a subject-independent framework represents a critical step toward enhancing its generalizability and real-world utility. Subject-independent experiments would involve training the model on data from one set of subjects and evaluating its performance on completely unseen subjects. This approach simulates a more realistic scenario, where the model’s effectiveness is tested across a diverse population, reinforcing its potential for practical applications in varied clinical or research settings. As we move forward, collaboration and interdisciplinary engagement become imperative. Collaborating with experts in neurology, psychiatry, and related fields can enrich our understanding of the nuanced EEG patterns associated with different neurological conditions. Such collaborations can guide the refinement and adaptation of our method to address the unique challenges posed by each domain [
38].
Hence, our future research endeavors are poised to address the limitations of our current study and elevate the impact of the proposed self-attention feature extraction method. By venturing into unexplored EEG domains, transitioning to multiclass classification, and embracing subject-independent experiments, we aspire to contribute to the broader landscape of neuroscience research. Through these concerted efforts, our aim is to not only advance the scientific understanding of brain activity but also pave the way for practical applications that hold the potential to transform diagnostics and treatments in the realm of neurological and psychiatric disorders.