Next Article in Journal
Association of Shift Work, Sociodemographic Variables and Healthy Habits with Obesity Scales
Previous Article in Journal
In Vitro Evaluation of Iraqi Kurdistan Tomato Accessions Under Drought Stress Conditions Using Polyethylene Glycol-6000
Previous Article in Special Issue
The Use of Transcranial Magnetic Stimulation in Attention Optimization Research: A Review from Basic Theory to Findings in Attention-Deficit/Hyperactivity Disorder and Depression
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Decoding Imagined Speech from EEG Data: A Hybrid Deep Learning Approach to Capturing Spatial and Temporal Features

by
Yasser F. Alharbi
* and
Yousef A. Alotaibi
Computer Engineering Department, King Saud University, Riyadh 11451, Saudi Arabia
*
Author to whom correspondence should be addressed.
Life 2024, 14(11), 1501; https://doi.org/10.3390/life14111501
Submission received: 24 October 2024 / Revised: 13 November 2024 / Accepted: 14 November 2024 / Published: 18 November 2024
(This article belongs to the Special Issue New Advances in Neuroimaging and Brain Functions: 2nd Edition)

Abstract

:
Neuroimaging is revolutionizing our ability to investigate the brain’s structural and functional properties, enabling us to visualize brain activity during diverse mental processes and actions. One of the most widely used neuroimaging techniques is electroencephalography (EEG), which records electrical activity from the brain using electrodes positioned on the scalp. EEG signals capture both spatial (brain region) and temporal (time-based) data. While a high temporal resolution is achievable with EEG, spatial resolution is comparatively limited. Consequently, capturing both spatial and temporal information from EEG data to recognize mental activities remains challenging. In this paper, we represent spatial and temporal information obtained from EEG signals by transforming EEG data into sequential topographic brain maps. We then apply hybrid deep learning models to capture the spatiotemporal features of the EEG topographic images and classify imagined English words. The hybrid framework utilizes a sequential combination of three-dimensional convolutional neural networks (3DCNNs) and recurrent neural networks (RNNs). The experimental results reveal the effectiveness of the proposed approach, achieving an average accuracy of 77.8% in identifying imagined English speech.

1. Introduction

Brain science has emerged as a crucial research field, particularly with recent advancements in neuroimaging methods, including functional magnetic resonance imaging (fMRI) and electroencephalography (EEG). These non-invasive tools enable us to record the neuronal activity of the brain and visualize its anatomy and function during various mental operations. This advancement has improved our comprehension of how the brain processes language, controls emotions, perceives stimuli, focuses attention, forms memories, and engages in decision making [1,2].
Different aspects of brain activity can be measured using non-invasive neuroimaging techniques, which are commonly categorized as either indirect or direct. fMRI is an indirect method that measures brain activity by detecting changes in blood flow and oxygen levels, known as the BOLD (blood–oxygen level-dependent) signal. This approach reflects neural activity indirectly through the hemodynamic response and requires a large and expensive device. This makes fMRI ideal for detailed brain mapping in cognitive research and clinical diagnosis [2,3].
In contrast, EEG is a widely used direct method that measures electrical activity by detecting voltage fluctuations from synchronized neuron firing. It captures the brain’s electrical activity from the scalp using multiple electrodes, providing real-time recordings of neural activity and offering important insights into brain function at different spatial and temporal resolutions [2,3]. This capability, combined with EEG’s portability and cost-effectiveness, makes it a powerful tool in neuroscience with potential applications in brain–computer interfaces. People can use these technologies to operate external devices, such as computer interfaces and prosthetic limbs [4]. Furthermore, EEG is utilized in clinical settings to diagnose and monitor various brain disorders. However, processing EEG signals presents challenges such as low signal-to-noise ratios, nonlinearity, and individual variability due to factors like age and psychological state [1,4].
In EEG-based recognition tasks, two key technical challenges often arise: the extraction of discriminative features from EEG signals and the development of effective computational models for accurate recognition. Traditionally, EEG classification has relied on hand-crafted features and classical machine learning techniques [1].
EEG features are usually extracted from four main domains: time, frequency, time–frequency, and spatial. Time-domain or temporal features capture signal values at specific time points or within time windows, while frequency-domain or spectral features measure signal power within defined frequency bands. Time-frequency features are obtained by analyzing the EEG signal in both the time and frequency domains simultaneously. Spatial features, on the other hand, focus on the spatial aspects of the signal, including the selection of relevant channels for specific tasks [5]. For example, in [6], time–frequency features were used to decode motor actions with a support vector machine. In contrast, another study [7] extracted features from all four domains and combined them with a source localization method to decode imagery intentions using an SVM-based model. While traditional methods have shown moderate success in decoding EEG, they require extensive domain knowledge to extract optimal features and often struggle with task generalization [4].
Recently, deep learning has been recognized as a significant tool in the fields of neuroimaging and brain monitoring/regulation. It has shown substantial potential for enhancing our understanding of brain function and advancing the development of brain–computer interfaces (BCIs) [4]. As a result, the application of deep learning algorithms to EEG-based tasks has been explored by many researchers. DL algorithms are structured into various architectural categories, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and hybrid neural network models. Among these architectures, CNNs have become the leading choice for EEG data classification tasks [8]. The use of CNNs for classification tasks involving imagined speech based on EEG signals has been widely investigated. Imagined speech involves mentally simulating speech without the physical movement of articulators or the production of sound [9,10,11].
Researchers have utilized various CNN-based techniques to enable the automatic learning of complex features and the classification of imagined speech from EEG signals. Research efforts in [12,13,14] explored various CNN-based methods for classifying imagined speech using raw EEG data or extracted features from the time domain.
Meanwhile, other studies have used images derived from EEG data as inputs for CNN-based models to recognize imagined speech. The study in [15] represented EEG data as two-dimensional images called scalograms, which combine time and frequency domains. The scalograms were then used as input for a CNN model with the purpose of classifying imagined words. The experimental results demonstrated that the CNN algorithm performed effectively when applied to scalogram images of EEG data.
Furthermore, in our previous work [16], we developed a method to classify imagined words using four CNN architectures. Each imagined word was represented by a single image consisting of sixteen EEG topographic brain maps, created from the time domain, to incorporate both spatial and temporal information from the EEG data. The study achieved effective classification accuracy, suggesting that topographic maps are a reliable method for distinguishing between different imagined speech patterns in EEG signals.
Although CNN-based models have demonstrated considerable success in classifying imagined speech from EEG signals, they are limited to capturing only local spatial features or short-term temporal patterns [17]. Simultaneously analyzing both the temporal and spatial aspects of EEG data is crucial, as it would provide a wealth of information concerning various brain states [18,19,20]. In this paper, we present a novel framework that employs a hybrid deep learning model based on three-dimensional (3D) CNNs and recurrent neural networks (RNNs) to effectively capture spatial and long-term dependencies in EEG data and subsequently classify imagined words. Our approach uses a sequence of topographic brain maps as input for the hybrid model, enabling the classification of distinct imagined words from EEG data.

2. Materials and Methods

The general flowchart representing the steps conducted in this study for imagined word classification using EEG signals is shown in Figure 1. The process begins with EEG Data Acquisition, where EEG signals of imagined speech are obtained from a public dataset. Next, in the Generating Topographic Maps step, the EEG signals are processed to create topographic brain images at regular intervals. These images are then Normalized and divided into training and testing datasets. In the Hybrid Model Training and Testing stage, a combination of 3D CNN and RNN architectures is created for model training and testing. Finally, in Performance Evaluation, accuracy metrics are calculated to evaluate the model’s effectiveness and reliability across different subjects.
The framework proposed for classifying imagined speech based on EEG data is divided into three main components, as shown in Figure 2. It begins with raw EEG signals that have been recorded from multiple electrodes while a person imagines different words. In the second step, the signals are transformed into a sequence of topographic maps. Next, the topographic maps are processed according to a hybrid deep learning model that combines 3DCNNs with different RNNs architectures as a way to capture spatial and temporal features. Finally, the output from each model is employed to classify the imagined word. Each of these three steps is detailed below.

2.1. EEG Data Acquisition

The proposed method was evaluated using the publicly available BCI2020 dataset for imagined speech [21]. EEG data were collected from 15 participants using a BrainAmp device (Brain Products GmbH, Gilching, Germany) with a sampling rate of 256 Hz and 64 electrodes. All 64 channels correspond to the international 10–20 system. Each participant completed 70 trials, imagining five words (“Hello”, “Help me”, “Stop”, “Thank you”, and “Yes”). The participants were instructed to silently imagine pronouncing the given word without moving their articulators or producing any sound. All the participants were healthy and right-handed [14].

2.2. Transforming EEG Signals into Topographic Brain Maps

Topographic brain mapping is a common approach for spatially analyzing neural activity [22]. It visualizes EEG signals as images, offering insights into the brain’s structural and functional connectivity. It can be generated using raw EEG signals or features that have been extracted in the time or frequency domain. The topographic maps technique converts EEG data from a one-dimensional (1D) time series into either a two-dimensional (2D) or three-dimensional (3D) image, capturing both the spatial and temporal aspects of the EEG signal. This technique involves interpolating EEG signals from electrode locations to a grid of points, creating a colour-coded scalp map [12,23,24].
These topographic maps typically remain stable for about 60 to 120 ms before transitioning to different stable configurations. This stability period reflects the transient state of the brain’s overall neuronal activity [25,26].
In this study, we focused on EEG data from fifteen electrodes positioned at different regions on the frontal lobe of the brain (Fp1, AF3, Fp2, AF4, AF7, AF8, F1, Fz, F2, F7, F5, F3, F4, F6, and F8). We specifically selected electrodes covering the left and right frontal lobes due to the significant association between these regions and imagined speech, as previously demonstrated in our study [16] and in others [27,28].
To capture the majority of neuronal activity during each imagined speech trial in the BCI2020 dataset (a 2 s window), we chose 125 ms as the stable time duration, based on evidence suggesting that EEG topographic maps typically remain consistent for 60 to 120 ms [25,26]. This decision strikes a balance between maintaining a high temporal resolution and optimizing data management, reducing computational complexity while still capturing critical brain activity patterns.
In our experiment, during each imagined speech interval (2 s), the raw EEG signals from the fifteen electrodes for each trial were converted into two-dimensional topographic maps every 125 ms, with a slight 4 ms offset. This method generates 16 topographic maps per imagined speech task, providing a key representation of the brain’s dynamic responses throughout the task (see Figure 3).
The topographic maps generated for use in each experiment were divided into two segments; the first (80%) to train the deep learning model and the second (20%) to test the trained model, as detailed in Table 1. Prior to this division, the topographic images were normalized by scaling the pixel values to a range between 0 and 1 by dividing each pixel value by 255. This normalization step ensures consistent input data for the model training and testing.
We chose a training–testing split evaluation method over k-fold cross-validation because the dataset size was small, and training deep neural networks proved highly resource-intensive, due to the significant computational demands and large number of parameters involved [30].

2.3. Deep Learning Model

EEG signal processing pipelines are critical in effectively analyzing and interpreting EEG data for various applications. These pipelines can be categorized into two main types: (1) hybrid and (2) end-to-end. (1) Hybrid processing pipelines incorporate data preprocessing techniques, typically involving digital signal processing (DSP) methods, to eliminate artifacts and/or extract features before the EEG data are fed into machine learning models. (2) End-to-end processing pipelines employ only deep learning models, directly inputting raw EEG data without any preprocessing. Deep learning (DL) can automatically learn complex high-level and latent features from raw EEG signals through its deep architecture, effectively eliminating the need for time-intensive preprocessing and feature extraction steps [23,31,32].
In this study, we introduced a hybrid deep learning model combining 3D CNN and RNN techniques to classify imagined speech, based on topographic brain images from EEG data.

2.3.1. Three-Dimensional Convolutional Neural Network (3D-CNN)

The convolutional neural network (CNN), also known as ConvNet, has become widely used as a powerful deep learning architecture and is particularly effective for handling high-dimensional data types, including images, videos, and EEG signals [33].
In CNNs, several types of convolutions are used to process different data types. Standard (2D) convolution is the most common and captures spatial features like edges and textures in 2D image data. Alternatively, 1D convolution operates across a single dimension and is appropriate for sequential data like time-series data, audio, or text. A further option is 3D convolution, which extends this concept into three dimensions, allowing it to capture both spatial and temporal features for classification tasks for videos, stacks of images, and 3D medical imaging [34].
As shown in Figure 4, the 3D CNN architecture is capable of analyzing the position of objects over time using a 3D activation map throughout the convolution process, which makes it valuable for both data interpretation and capturing temporal context. The 3D convolution process uses a filter that shifts along three axes (x, y, z) to compute low-level feature representations. Each feature map location’s value can be computed using Equation (1). The resulting output forms a three-dimensional volume.
V i j x y z = tanh ( b i j + m p = 0 P i 1 q = 0 Q i 1 r = 0 R i 1 w i j m p q r   v i 1 m x + p y + q z + r )
where V i j x y z is the output at position (x, y, z), w i j m p q r represents the value of the kernel linked to the feature map from the preceding layer, v i 1 m x + p y + q z + r   is the value from the previous layer’s feature map, b i j is the bias term, m is an index over the input feature maps from the previous layer, and P i ,   Q i , a n d   R i denote the Kernel size in the x, y, and z directions, respectively.
The functional maps are linked to record motion information; however, the convolution kernel is limited to extracting only one type of feature. The overall architecture of the network resembles that of a 2D convolutional neural network. Typically, similar to 2D convolution, better results can be achieved by merging several convolutional layers. The effectiveness of a 3D CNN relies on both the quantity of layers and the number and size of the filters within each layer. Although designed for 3D data, 3D convolutions can also be applied to 2D inputs such as images [35].

2.3.2. Recurrent Neural Network (RNN)

A recurrent neural network (RNN) is a powerful type of neural network for managing sequence dependencies for text, audio, and video. A key characteristic of RNN architecture is its cyclic connections, enabling it to update its current state by considering both the present input data and prior states.
A popular variant of RNN is the Long Short-Term Memory (LSTM) network, introduced to address long-term dependencies in sequential data. The LSTM cell improves the memory capabilities of the traditional recurrent cell by implementing a “gate” mechanism within the cell [36]. As Figure 5a illustrates, LSTM consists of multiple cells, each corresponding to a specific time step. Within each LSTM cell, several gates serve the purpose of learning about different aspects of the input time series, helping the network manage and retain relevant information [37]. The operation of the LSTM block is governed by three gates: the input gate, the forget gate, and the output gate. Each of these gates determines the operations performed by the LSTM block when processing incoming inputs. At each time step, both the memory state and the output state are refreshed according to the following equations:
  i t = σ ( W i   h t 1 , x t + b i )
f t = σ ( W f   h t 1 , x t + b f )
c t = f t c t 1 + i t t a n h ( W c   h t 1 , x t + b c )
o t = σ ( W o   h t 1 , x t + b o )
h t = o t   tanh   ( c t )  
where x t is the input sequence, h t represents the output sequence, and c t indicates memory state at time t. Cell activations are denoted by c , and all values share the same dimensionality as the input vector. The input gate is represented by i , the forget gate by f , and the output gate by o . The symbol σ signifies the nonlinear sigmoid function, which is essential for determining gate activations.
The extension of the original LSTM is the stacked LSTM, which features multiple hidden LSTM layers, each containing several memory cells. As shown in Figure 5b, the stacked LSTM architecture comprises several LSTM layers arranged in a stack. Each layer analyzes the output from the preceding layer, enabling the network to understand more complex representations of the input series [37].
Unlike traditional RNNs that process input sequences unidirectionally (either forward or backward), bidirectional LSTM (BiLSTM) networks are designed to capture information from both directions. This, therefore, gives them an enhanced capacity to understand context and dependencies within the data, making them particularly useful for tasks where context from both directions is essential. As illustrated in Figure 6, the BiLSTM architecture comprises two distinct LSTM layers: the forward LSTM layer, which processes the input sequence from beginning to finish, and the backward LSTM layer, which does the reverse [37]. The BiLSTM model consists of two LSTM layers that operate in both forward and backward directions, aiding in the detection of bidirectional long-term interdependencies between temporal phases. Therefore, the advantage of using BiLSTM is that features from both past and future time steps are included in the output [33].

2.3.3. Proposed Hybrid Deep Learning Architectures

As mentioned earlier, EEG signal analysis faces challenges such as low signal-to-noise ratios, nonlinearity, and individual variability. However, DL models have the potential to capture whole-brain dynamic information and leverage time-varying functional connectivity profiles, offering a promising avenue for advancing our understanding of brain function and disorders [38]. Our proposed architectures aim to extract both temporal and spatial features from sequences of topographic brain maps generated from raw EEG data to differentiate between various imagined words. We present hybrid models that integrate a 3DCNN for capturing spatial features from each topographic image and an RNN model for extracting both spatial and temporal patterns to process sequential topographic images. In this study, we propose three hybrid models as follows:
  • 3DCNN-LSTM model:
The structure of the proposed 3DCNN-LSTM model is presented in Figure 7. It begins by processing the input data, comprising 16 topographic maps, each being a 64 × 64 image with 3 color channels. The model first employs a 3D convolutional neural network (CNN), designed for spatial feature extraction. The first convolutional layer applies 16 filters with a 3 × 3 × 3 kernel, using ReLU activation and ‘same’ padding to preserve the spatial dimensions. This is followed by a 3D max pooling layer, with a 2 × 2 × 2 pool size intended to reduce spatial dimensions and computational complexity. The second convolutional layer increases the filters to 32, again using a 3 × 3 × 3 kernel with ReLU activation and ‘same’ padding, followed by another 2 × 2 × 2 max pooling layer. A third convolutional layer of 64 filters continues this pattern, and a final max pooling layer is then applied to improve the extraction of essential spatial characteristics.
After the 3D CNN layers, the output is flattened into a 1D vector in preparation for the next phase. The flattened data are then reshaped into a 2D sequence with 16 timesteps, making it suitable for processing in a long short-term memory (LSTM) network, which captures temporal dependencies. The LSTM layer has 64 units and outputs a single value after processing the full sequence, as return_sequences are set to a False value. To mitigate overfitting, a dropout layer with a rate of 40% is incorporated following the LSTM.
Finally, the output layer, a fully connected (dense) layer with softmax activation, generates a probability distribution for the number of classes, enabling the model to perform binary or multi-class classification. This architecture combines the strengths of the 3D CNN for spatial feature extraction and LSTM for temporal pattern recognition, making it suitable for imagined word classification using topographic brain maps including EEG data.
2.
3DCNN-StackLSTM model:
In this model, we replace the standard LSTM shown in Figure 7 with a dual LSTM network. The stacked LSTM layers in the 3DCNN-Stacked LSTM model are essential for capturing the complex temporal patterns and dependencies in the sequential data derived from topographic brain maps. The first LSTM layer is configured with 64 units and is set to return sequences, allowing it to output a sequence of hidden states for each input timestep. This is essential for enabling the subsequent LSTM layer to process the entire sequence of outputs so that each unit learns temporal relationships within the data while managing information flows through its gating mechanisms. The second LSTM layer, also consisting of 64 units, is configured not to return sequences, thereby compressing the output of the first layer into a single hidden state representing the entire input sequence. This design is beneficial for classification tasks and for capturing higher-level temporal features and interactions.
3.
3DCNN-BiLSTM model:
In this model, we replace the standard LSTM (shown in Figure 7) with a BiLSTM layer. The BiLSTM layer is a crucial component for temporal feature extraction after the spatial features have been captured by the preceding 3D CNN layers. This layer is configured with 64 units and is designed to process the reshaped 3D output of the CNN, which has been adjusted to fit the BiLSTM’s expected input format in terms of timesteps and features.
By processing the input sequence of 16 timesteps in both forward and backward directions, the BiLSTM effectively captures contextual information taken from both past and future images of the topographic sequence. This dual perspective allows the model to leverage the temporal dynamics of the data more comprehensively, making it especially powerful for classifying sequences of topographic brain images. Since the return_sequences parameter is set to False, the BiLSTM outputs a single vector to encapsulate the temporal context of the entire input sequence, which is then passed to the dropout layer for regularization prior to being fed into the output layer for classification. This design enhances the capacity of the model to recognize complex patterns over time.

3. Results

We implemented our proposed method using Python 3.10 and TensorFlow 2.15. All training and evaluation experiments were performed using NVIDIA GeForce RTX 2080 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA). The Adam optimizer was employed for training the models, utilizing a learning rate of 0.0001 and a batch size of 32 consistently across all experiments.
We formulated the challenge of categorizing imagined words as a supervised classification problem. In this context, to represent the imagined speech, the input was the sequence of sixteen topographic images of EEG data and the target output was the imagined word.
The efficiency of the deployed model will be evaluated using a subject-dependent approach, where each participant’s data are individually utilized for both training and testing. The average accuracy metric will be utilized to assess the effectiveness of the three hybrid deep learning models in this study, as it is the most commonly used experimental score in various EEG-based imagined speech research [8,39]. Each model’s accuracy will be measured for each subject using Equation (7), and then the overall average accuracy across all subjects will be computed using Equation (8).
A c c u r a c y = N u m b e r   o f   c o r r e c t   p r e d i c t i o n s T o t a l   n u m b e r   o f   p r e d i c t i o n s
A v e r a g e   A c c u r a c y = s = 1 N A c c u r a c y s N
where N indicates the number of subjects.
The experiment ensured equal representation of each class by balancing the participants’ trial counts. Several tasks were conducted using three hybrid deep learning models for word-pair and multi-class classification.

3.1. Word-Pair Classification Results

We categorized our experiments for the word-pair classification task according to word length as follows:
  • Short Imagined Words Classification:
In this experiment, we employed EEG topographic maps for three imagined words from the BCI2020 dataset—Yes, Stop, and Hello—to perform word-pair classification using three hybrid models.
Figure 8 and Table 2 present the average performance of three model architectures—3DCNN-LSTM (Model 1), 3DCNN-StackLSTM (Model 2), and 3DCNN-BiLSTM (Model 3) when classifying EEG topographic images for the three-word pairs: Hello–Yes, Stop–Yes, and Hello–Stop. The 3DCNN-BiLSTM model consistently outperforms the others, achieving the highest average scores across all pairs: 74% for Hello–Yes, 72.2% for Stop–Yes, and 77.8% for Hello–Stop. In contrast, the 3DCNN-StackLSTM model shows the lowest performance. The 3DCNN-LSTM model performs in the middle, showing competitive results, especially for the Hello–Stop pair (77.3%).
In addition, Subject #4 achieved the highest accuracy for the ‘Hello–Yes’ task regardless of the model architecture used, while Subject #8 had the highest classification accuracy for the ‘Hello–Stop’ task, also regardless of the model architecture. However, for the ‘Stop–Yes’ classification task, Subject #3 achieved the highest accuracy using the 3DCNN-StackLSTM model. This variation in performance is likely attributable to individual differences in brain signal characteristics and the ability of each model to extract key features from EEG data for each subject.
2.
Long Imagined Phrases Classification:
In this experiment, we used the EEG topographic maps for two long imagined phrases from the BCI2020 dataset—Thank you and Help me—to conduct the word-pair classification with the three hybrid models.
Figure 9 and Table 3 show the average accuracy when classifying the imagined phrases “Help me” and “Thank you” using three different hybrid models. The results indicate that the 3DCNN-BiLSTM model was the most effective, achieving the highest average accuracy of 75.2% when classifying the words “Help me” and “Thank you”. Its design allows for the processing of input sequences from both forward and backward perspectives, which proved advantageous when leveraging contextual information. Similarly, the 3DCNN-LSTM model followed closely, with an average accuracy of 74.8%. This model demonstrated a strong capability to capture both the spatial and temporal features essential for accurate classification. In contrast, the 3DCNN-StackLSTM model exhibited a competitive performance with an average accuracy of 71.6% but was the least effective of the three architectures.
Comparing the results based on subjects’ performance, it is clear from Table 3 that Subject #3 achieved the highest accuracy with the 3DCNN-BiLSTM model (86.79%), while Subject #7 achieved the highest accuracy with the 3DCNN-StackLSTM model (85.36%). Additionally, Subject #12 achieved the highest accuracy with the 3DCNN-StackLSTM model (86.43%). The results of other subjects showed more variability in each model’s performance, highlighting the diverse nature of EEG data and the need for customized classification strategies for each subject.
3.
Long–Short Imagined Words Classification:
In this experiment, we utilized EEG topographic maps of a mixture of imagined words from categories 1 and 2 to perform imagined word-pair classification using three hybrid models.
Table 4 and Table 5 and Figure 10 represent the performance results of three different hybrid models—across two sets of imagined word pairs: “Thank You–Hello”, “Thank you–Stop”, and “Thank you–Yes”, and “Help me–Hello”, “Help me–Stop”, and “Help me–Yes”.
The performance results across the two sets of word pairs demonstrate that the 3DCNN-BiLSTM model consistently delivers the highest average performance in all cases, achieving averages of 76.7% for the ‘Thank you–Stop’ word pair and 77.0% for the ‘Help me–Yes’ word pair. 3DCNN + LSTM follows closely behind, providing a solid middle-ground performance, often close to 3DCNN-BiLSTM.
On the other hand, 3DCNN-StackLSTM consistently exhibits the lowest average scores, falling behind the other two models across all word pairs.
In the overall results of word-pair classification tasks, the 3DCNN-BiLSTM model consistently showed the best performance across all tasks, while the 3DCNN-StackLSTM model generally underperformed compared to the other two models. Although all models demonstrated some effectiveness in classifying imagined words, performance varied significantly across subjects. This inconsistency highlights the influence of individual differences and the unique characteristics of EEG data on classification accuracy. Thus, the results emphasize the need for careful model selection based on each subject’s specific data.

3.2. Multi-Word Classification

In this experiment, we utilized EEG topographic maps of multiple imagined words to perform a multiword classification task using three hybrid models.
Table 6 reveals the experimental results from the three models (3D CNN-LSTM, 3D CNN-StackLSTM, and 3D CNN-BiLSTM) across multiple subjects to classify the five imagined words in the BCI2020 dataset, based on the sequence of EEG topographic maps. The results indicate the 3DCNN-StackLSTM model performs best overall, with an average score of 44.7%, outperforming both the 3DCNN-LSTM (40.7%) and the 3DCNN-BiLSTM (42.2%). Moreover, individual performance varied depending on the model used. The 3DCNN-LSTM model achieved the highest accuracy for Subject #7 (50.14%), while other models performed better for different subjects.
In addition, we investigated the performance of three different model architectures—3DCNN-LSTM, 3DCNN-StackLSTM, and 3DCNN-BiLSTM in the three short imagined word (Hello, Yes, and Stop) classification task.
Table 7 presents the average accuracy of the three models when classifying the three words across fifteen subjects. Among the models, the 3D CNN-Stack LSTM achieved the highest average accuracy of 59.5%, capturing the complex temporal dependencies within the data. The 3DCNN-BiLSTM followed closely, with an average accuracy of 59.2%, highlighting the impact of bidirectional processing when understanding contexts from both past and future time steps. In contrast, the 3DCNN-LSTM model recorded a lower average accuracy of 57.9%.
Furthermore, individual subject performance varied, with subject #4 yielding the highest accuracy across all the models; notably, in the 3DCNN-BiLSTM model, accuracy peaked at 70.95%. In contrast, Subject #5 showed the lowest accuracy across almost all the models. This indicates that the data from certain subjects may present greater challenges for the models to handle effectively.
In the overall results of multi-word classification tasks, the 3DCNN-StackLSTM model demonstrated good performance. Meanwhile, the 3D CNN-BiLSTM model also performed well, especially for some subjects. Also, the models exhibited diverse performance across subjects, indicating that the distinct characteristics of EEG data and individual differences created unique challenges for each model, especially in the multi-word classification task, where the difficulty increased

4. Discussion

The results of our experiments demonstrate that hybrid deep learning models can effectively classify imagined words from EEG topographic maps by capturing essential temporal and spatial features in EEG signals, even in the presence of typical challenges like low signal-to-noise ratios and individual variability. This approach shows promise for improving the accurate decoding of cognitive states based on neural data. Theoretically, these results provide valuable insights into the neural mechanisms underlying language processing, suggesting that hybrid models can capture complex neural dynamics associated with the generation and representation of imagined words. This advancement extends our understanding of functional brain connectivity in cognitive tasks and holds promise for several applications.
Our study’s results in imagined word-pair classification demonstrate that the 3DCNN-BiLSTM model significantly outperformed other architectures across nearly all word pairs, underscoring its particular suitability for imagined speech decoding tasks. This result is grounded in the model’s unique ability to process EEG sequences bidirectionally through the BiLSTM, enabling it to capture both the preceding and succeeding context in the data. Such context is crucial in imagined speech, where neural signals are inherently nuanced and context-dependent.
Furthermore, our multi-word classification experiments indicate that the 3DCNN-StackLSTM model emerged as the top performer for multi-class classification tasks. The deeper architecture of the StackLSTM allows it to capture complex temporal dependencies and develop richer sequential representations, which are crucial for distinguishing between multiple imagined words.
Practically, the 3DCNN-BiLSTM model’s enhanced capacity to interpret EEG signals for both short and long imagined word pairs suggests that it is a robust choice for several applications where linguistic variability in length and structure is common. Furthermore, its effectiveness in processing diverse imagined word types could enhance both medical and non-medical applications that rely on the precise decoding of imagined speech.
Theoretically, these findings underscore the advantage of using complex architectures for EEG data that involve intricate temporal dependencies. Although simpler models can be competitive in some cases, they often lack the depth required to fully capture these dependencies, particularly in context-sensitive tasks like imagined speech.
To further illustrate the comparative performance of the models, Table 8 sets out the training times for each model for the different classification tasks. The results show that training times increase with the complexity of the classification tasks. Among the architectures, 3DCNN-StackLSTM required the longest training time. This was due to the additional layers in the stacked LSTM, which introduce more parameters and complexity. In comparison, 3DCNN-LSTM and 3DCNN-BiLSTM required shorter training times, with the bidirectional LSTM needing more than the unidirectional counterpart due to the bi-directional data processing.
Although the 3DCNN-StackLSTM had the highest computational cost, it might offer improved performance for more complex multi-classification tasks by capturing richer temporal dependencies. Meanwhile, 3DCNN-BiLSTM balances the trade-off between model complexity and training time, affording a slight increase in time for potentially enhanced feature extraction. The 3DCNN-LSTM, despite being the fastest, may be better suited to simpler tasks where computational efficiency is a priority.
According to [1], the brain consists of numerous neurons whose activity generates distinct electrical potentials on the scalp that vary according to the level of alertness, responses to external stimuli, and other individual-specific factors. The results reveal that the individual differences among subjects significantly influenced the classification outcomes. This variability highlights the importance of accounting for individual characteristics when designing and evaluating EEG-based models, as each subject’s unique patterns and signal characteristics can pose challenges that impact the models’ ability to generalize effectively across different individuals.
Comparing our findings with recent studies on imagined word recognition using EEG data is difficult due to several factors, including the differences in data acquisition protocols, participant numbers, the variety and type of imagined speech words, and the classification algorithms used. Nevertheless, Table 9 provides a comparison of average accuracies derived from several recent works.
The table indicates that the majority of the utilized datasets are private, with the exceptions being the KaraOne database (KaraOne DB), the Pressel Coretto database (Pressel. DB), and the BCI Competition database (BCI DB), which are publicly available. Additionally, the number of subjects in each dataset varied significantly, ranging from as few as one subject to as many as fifteen (our work). The intervals for the imagined word trials also differed greatly, ranging from 1 s to 5 s. It is well known that longer durations of repeated word imagination can enhance model accuracy, although this renders the system less feasible as a tool for real-world use [10,39]. Taking these factors into account, our study represents a significant improvement in performance, as measured by average accuracy, in comparison to other leading methods.

5. Conclusions

This study proposes a hybrid deep learning framework for classifying imagined speech from EEG signals by extracting spatial and temporal features from topographic brain maps. Three hybrid deep learning models were applied and evaluated using the average accuracy metric.
Our findings highlight the effectiveness of combining 3DCNN and RNN models for EEG-based imagined speech classification. The 3DCNN-BiLSTM model achieved the highest average accuracy of 77.8% in word-pair tasks, while the 3DCNN-StackLSTM performed best in multi-class classification. While the models showed promising results, the variability between subjects remains a significant challenge. These findings suggest practical applications for brain–computer interface development and emphasize the theoretical need for personalized models that can adapt to individual EEG characteristics, enhancing the reliability and accuracy of imagined speech decoding.
This study presents several limitations. The limited dataset size may restrict the model’s ability to generalize across diverse EEG patterns, impacting performance consistency. Additionally, using only fifteen electrodes from the frontal lobe could reduce spatial information capture, potentially overlooking regions of the brain that may also be involved in imagined speech processing. Subject variability also presents a challenge, as individual differences in EEG patterns can affect model performance. Future work could benefit from expanded electrode coverage, larger datasets, and tailored approaches to account for individual EEG characteristics to improve generalizability.

Author Contributions

Conceptualization, Y.F.A. and Y.A.A.; methodology, Y.F.A.; software, Y.F.A.; validation, Y.F.A. and Y.A.A.; formal analysis, Y.F.A. and Y.A.A.; investigation, Y.F.A. and Y.A.A.; resources, Y.F.A. and Y.A.A.; data curation, Y.F.A.; writing—original draft preparation, Y.F.A.; writing—review and editing, Y.A.A.; visualization, Y.F.A.; supervision, Y.A.A.; project administration, Y.A.A.; funding acquisition, Y.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Researchers Supporting Project number (RSP2024R322), King Saud University, Riyadh, Saudi Arabia.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is available at: https://osf.io/pq7vb (accessed on 30 September 2024).

Acknowledgments

This work was supported by the Researchers Supporting Project number (RSP2024R322), King Saud University, Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chaddad, A.; Wu, Y.; Kateb, R.; Bouridane, A. Electroencephalography signal processing: A comprehensive review and analysis of methods and techniques. Sensors 2023, 23, 6434. [Google Scholar] [CrossRef] [PubMed]
  2. Litwińczuk, M.C.; Trujillo-Barreto, N.; Muhlert, N.; Cloutman, L.; Woollams, A. Relating cognition to both brain structure and function: A systematic review of methods. Brain Connect. 2023, 13, 120–132. [Google Scholar] [CrossRef] [PubMed]
  3. Yen, C.; Lin, C.-L.; Chiang, M.-C. Exploring the frontiers of neuroimaging: A review of recent advances in understanding brain functioning and disorders. Life 2023, 13, 1472. [Google Scholar] [CrossRef] [PubMed]
  4. Xu, M.; Ouyang, Y.; Yuan, Z. Deep learning aided neuroimaging and brain regulation. Sensors 2023, 23, 4993. [Google Scholar] [CrossRef]
  5. Pawar, D.; Dhage, S. Feature extraction methods for electroen-cephalography based brain-computer interface: A review. Entropy 2020, 1, 4. [Google Scholar]
  6. Kim, H.; Yoshimura, N.; Koike, Y. Characteristics of kinematic parameters in decoding intended reaching movements using electroencephalography (EEG). Front. Neurosci. 2019, 13, 1148. [Google Scholar] [CrossRef]
  7. Peng, X.; Liu, J.; Huang, Y.; Mao, Y.; Li, D. Classification of lower limb motor imagery based on iterative EEG source localization and feature fusion. Neural Comput. Appl. 2023, 35, 13711–13724. [Google Scholar] [CrossRef]
  8. Shah, U.; Alzubaidi, M.; Mohsen, F.; Abd-Alrazaq, A.; Alam, T.; Househ, M. The Role of Artificial Intelligence in Decoding Speech from EEG Signals: A Scoping Review. Sensors 2022, 22, 6975. [Google Scholar] [CrossRef]
  9. Cooney, C.; Folli, R.; Coyle, D. Neurolinguistics Research Advancing Development of a Direct-Speech Brain-Computer Interface. iScience 2018, 8, 103–125. [Google Scholar] [CrossRef]
  10. Lopez-Bernal, D.; Balderas, D.; Ponce, P.; Molina, A. A State-of-the-Art Review of EEG-Based Imagined Speech Decoding. Front. Hum. Neurosci. 2022, 16, 867281. [Google Scholar] [CrossRef]
  11. Sharon, R.A.; Narayanan, S.S.; Sur, M.; Hema Murthy, A. Neural Speech Decoding during Audition, Imagination and Production. IEEE Access 2020, 8, 149714–149729. [Google Scholar] [CrossRef]
  12. Liang, X.; Liu, Y.; Yu, Y.; Liu, K.; Liu, Y.; Zhou, Z. Convolutional Neural Network with a Topographic Representation Module for EEG-Based Brain—Computer Interfaces. Brain Sci. 2023, 13, 268. [Google Scholar] [CrossRef] [PubMed]
  13. Datta, S.; Holmberg, J.J.; Antonova, E. Electrode Selection and Convolutional Attention Network for Recognition of Silently Spoken Words from EEG Signals. In Proceedings of the 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 5–7 December 2021; pp. 1–8. [Google Scholar]
  14. Jeong, J.-H.; Cho, J.-H.; Lee, Y.-E.; Lee, S.-H.; Shin, G.-H.; Kweon, Y.-S.; Millán, J.d.R.; Müller, K.-R.; Lee, S.-W. 2020 International brain–computer interface competition: A review. Front. Hum. Neurosci. 2022, 16, 898300. [Google Scholar] [CrossRef] [PubMed]
  15. Agarwal, P.; Kumar, S. Electroencephalography-based imagined speech recognition using deep long short-term memory network. ETRI J. 2022, 44, 672–685. [Google Scholar] [CrossRef]
  16. Alharbi, Y.F.; Alotaibi, Y.A. Imagined Speech Recognition and the Role of Brain Areas Based on Topographical Maps of EEG Signal. In Proceedings of the 2024 47th International Conference on Telecommunications and Signal Processing (TSP), Virtual Event, 10–12 July 2024; pp. 274–279. [Google Scholar]
  17. Yang, X.; Jia, Z. Spatial-Temporal Mamba Network for EEG-based Motor Imagery Classification. arXiv 2024, arXiv:2409.09627. [Google Scholar]
  18. Zhang, D.; Yao, L.; Chen, K.; Wang, S.; Chang, X.; Liu, Y. Making sense of spatio-temporal preserving representations for EEG-based human intention recognition. IEEE Trans. Cybern. 2019, 50, 3033–3044. [Google Scholar] [CrossRef]
  19. Buzzelli, M.; Bianco, S.; Napoletano, P. Unified framework for identity and imagined action recognition from eeg patterns. IEEE Trans. Hum.-Mach. Syst. 2023, 53, 529–537. [Google Scholar] [CrossRef]
  20. Avberšek, L.K.; Repovš, G. Deep learning in neuroimaging data analysis: Applications, challenges, and solutions. Front. Neuroimaging 2022, 1, 981642. [Google Scholar] [CrossRef]
  21. BCI Competition Committee. 2020 International BCI Competition. 2022. Available online: https://osf.io/pq7vb (accessed on 30 September 2024).
  22. Vafaei, E.; Nowshiravan Rahatabad, F.; Setarehdan, S.K.; Azadfallah, P. Extracting a novel emotional EEG topographic map based on a stacked autoencoder network. J. Healthc. Eng. 2023, 2023, 9223599. [Google Scholar] [CrossRef]
  23. Altaheri, H.; Muhammad, G.; Alsulaiman, M.; Amin, S.U.; Altuwaijri, G.A.; Abdul, W.; Bencherif, M.A.; Faisal, M. Deep learning techniques for classification of electroencephalogram (EEG) motor imagery (MI) signals: A review. Neural Comput. Appl. 2023, 35, 14681–14722. [Google Scholar] [CrossRef]
  24. Zhao, M.; Zhang, S.; Mao, X.; Sun, L. EEG Topography Amplification Using FastGAN-ASP Method. Electronics 2023, 12, 4944. [Google Scholar] [CrossRef]
  25. Michel, C.M.; Koenig, T. EEG microstates as a tool for studying the temporal dynamics of whole-brain neuronal networks: A review. Neuroimage 2018, 180, 577–593. [Google Scholar] [CrossRef] [PubMed]
  26. Mishra, A.; Englitz, B.; Cohen, M.X. EEG microstates as a continuous phenomenon. Neuroimage 2020, 208, 116454. [Google Scholar] [CrossRef] [PubMed]
  27. Einizade, A.; Mozafari, M.; Jalilpour, S.; Bagheri, S.; Hajipour Sardouie, S. Neural decoding of imagined speech from EEG signals using the fusion of graph signal processing and graph learning techniques. Neurosci. Inform. 2022, 2, 100091. [Google Scholar] [CrossRef]
  28. Hossain, A.; Das, K.; Khan, P.; Kader, M.F. A BCI system for imagined Bengali speech recognition. Mach. Learn. Appl. 2023, 13, 100486. [Google Scholar] [CrossRef]
  29. Glomb, K.; Cabral, J.; Cattani, A.; Mazzoni, A.; Raj, A.; Franceschiello, B. Computational models in electroencephalography. Brain Topogr. 2021, 35, 142–161. [Google Scholar] [CrossRef]
  30. Hafeez, U.U.; Gandhi, A. Empirical Analysis and Modeling of Compute Times of Cnn Operations on Aws Cloud. In Proceedings of the 2020 IEEE International Symposium on Workload Characterization (IISWC), Beijing, China, 27–30 October 2020; pp. 181–192. [Google Scholar]
  31. Nakagome, S.; Craik, A.; Sujatha Ravindran, A.; He, Y.; Cruz-Garza, J.G.; Contreras-Vidal, J.L. Deep Learning Methods for EEG Neural Classification. In Handbook of Neuroengineering; Springer: Cham, Switzerland, 2022; pp. 1–39. [Google Scholar]
  32. Hossain, K.M.; Islam, M.A.; Hossain, S.; Nijholt, A.; Ahad, M.A.R. Status of deep learning for EEG-based brain–computer interface applications. Front. Comput. Neurosci. 2023, 16, 1006763. [Google Scholar] [CrossRef]
  33. Sarker, I.H. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2021, 2, 420. [Google Scholar] [CrossRef]
  34. Younesi, A.; Ansari, M.; Fazli, M.; Ejlali, A.; Shafique, M.; Henkel, J. A Comprehensive Survey of Convolutions in Deep Learning: Applications, Challenges, and Future Trends. IEEE Access 2024, 12, 41180–41218. [Google Scholar] [CrossRef]
  35. Vrskova, R.; Hudec, R.; Kamencay, P.; Sykora, P. Human activity classification using the 3DCNN architecture. Appl. Sci. 2022, 12, 931. [Google Scholar] [CrossRef]
  36. Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef] [PubMed]
  37. Ghojogh, B.; Ghodsi, A. Recurrent neural networks and long short-term memory networks: Tutorial and survey. arXiv 2023, arXiv:2304.11461. [Google Scholar]
  38. Yan, W.; Qu, G.; Hu, W.; Abrol, A.; Cai, B.; Qiao, C.; Plis, S.M.; Wang, Y.-P.; Sui, J.; Calhoun, V.D. Deep learning in neuroimaging: Promises and challenges. IEEE Signal Process. Mag. 2022, 39, 87–98. [Google Scholar] [CrossRef]
  39. Panachakel, J.T.; Ramakrishnan, A.G. Decoding Covert Speech From EEG-A Comprehensive Review. Front. Neurosci. 2021, 15, 642251. [Google Scholar] [CrossRef]
  40. Agarwal, P.; Kumar, S. Imagined word pairs recognition from non-invasive brain signals using Hilbert transform. Int. J. Syst. Assur. Eng. Manag. 2021, 13, 385–394. [Google Scholar] [CrossRef]
  41. Singh, A.; Gumaste, A. Decoding imagined speech and computer control using brain waves. J. Neurosci. Methods 2021, 358, 109196. [Google Scholar] [CrossRef]
  42. Lee, D.Y.; Lee, M.; Lee, S.W. Decoding Imagined Speech Based on Deep Metric Learning for Intuitive BCI Communication. IEEE Trans. Neural Syst. Rehabil. Eng. 2021, 29, 1363–1374. [Google Scholar] [CrossRef]
  43. Bakhshali, M.A.; Khademi, M.; Ebrahimi-Moghadam, A. Investigating the neural correlates of imagined speech: An EEG-based connectivity analysis. Digit. Signal Process. A Rev. J. 2022, 123, 103435. [Google Scholar] [CrossRef]
  44. Varshney, Y.V.; Khan, A. Imagined Speech Classification Using Six Phonetically Distributed Words. Front. Signal Process. 2022, 2, 760643. [Google Scholar] [CrossRef]
  45. Lee, S.-H.; Lee, Y.-E.; Lee, S.-W. Toward Imagined Speech based Smart Communication System: Potential Applications on Metaverse Conditions. arXiv 2022, arXiv:2112.08569. [Google Scholar] [CrossRef]
Figure 1. The general flowchart outlining the steps of our study.
Figure 1. The general flowchart outlining the steps of our study.
Life 14 01501 g001
Figure 2. The proposed framework for identifying imagined words using EEG signals.
Figure 2. The proposed framework for identifying imagined words using EEG signals.
Life 14 01501 g002
Figure 3. Transformation of EEG data into topographic maps [29].
Figure 3. Transformation of EEG data into topographic maps [29].
Life 14 01501 g003
Figure 4. 3D convolution procedure [35].
Figure 4. 3D convolution procedure [35].
Life 14 01501 g004
Figure 5. LSTM architectures: (a) simple LSTM; (b) stacked LSTM [37].
Figure 5. LSTM architectures: (a) simple LSTM; (b) stacked LSTM [37].
Life 14 01501 g005
Figure 6. BiLSTM architecture.
Figure 6. BiLSTM architecture.
Life 14 01501 g006
Figure 7. 3DCNN-LSTM architecture.
Figure 7. 3DCNN-LSTM architecture.
Life 14 01501 g007
Figure 8. Average accuracy of short-word classification using three hybrid models.
Figure 8. Average accuracy of short-word classification using three hybrid models.
Life 14 01501 g008
Figure 9. Average accuracy of long imagined phrase classification using three models.
Figure 9. Average accuracy of long imagined phrase classification using three models.
Life 14 01501 g009
Figure 10. Average accuracy of long–short imagined words classification.
Figure 10. Average accuracy of long–short imagined words classification.
Life 14 01501 g010
Table 1. Data division for model training and testing.
Table 1. Data division for model training and testing.
ExperimentTrainTestTotal
Word-pair classification112 (80%)28 (20%)140
3-class classification168 (80%)42 (20%)210
5-class classification280 (80%)70 (20%)350
Table 2. Average accuracy of binary classification for short imagined words using three models.
Table 2. Average accuracy of binary classification for short imagined words using three models.
Word-PairHello–YesStop–YesHello–Stop
Subject\Model *123123123
172.57078.5772.559.2971.4387.1475.3686.79
271.7964.6474.2968.2168.577079.6468.5773.57
36563.5766.4376.7983.5782.1474.6471.7976.79
483.9380.7182.573.9368.2167.1475.3668.2175.36
573.936571.0771.7966.0772.8676.435578.21
672.8673.9379.6468.9371.7971.7977.8667.1475.71
766.0768.5764.6477.571.0773.9374.2968.2175.36
879.2972.578.2182.579.2982.587.8680.7187.14
968.2163.5769.2971.4366.4371.0771.4370.3670.71
1075.7165.3675.7161.7963.9366.4368.9367.571.79
1176.0666.7977.8666.7968.9372.577.8657.1476.06
1267.1463.5768.5762.8662.560.7168.5766.7968.21
1365.3661.7968.9370.3665.3671.0780.3659.6485.36
148068.218068.9369.2969.6478.9370.3683.93
1571.0764.2973.9378.5772.579.2980.7168.2181.43
Average72.667.57471.569.172.277.367.777.8
* 3DCNN + LSTM (1), 3DCNN + StackLSTM (2), 3DCNN + BiLSTM (3), Max = boldfaced, Min = underlined.
Table 3. Average accuracy of long imagined phrase classification using three models.
Table 3. Average accuracy of long imagined phrase classification using three models.
Subject\Model3DCNN + LSTM3DCNN + StackLSTM3DCNN + BiLSTM
169.6467.1468.21
264.6468.9371.79
385.3677.8686.79
475.7164.2973.93
568.577070.71
675.3679.6477.5
782.1485.3683.21
878.577577.14
969.6473.9364.29
1072.8668.2174.64
1177.1464.6475.71
1286.4378.2187.5
1368.2163.5766.43
1466.0766.0768.21
1581.0770.7181.79
Average74.871.675.2
Max = boldfaced, Min = underlined.
Table 4. Average accuracy of binary classification for ‘Thank You’ set words using three models.
Table 4. Average accuracy of binary classification for ‘Thank You’ set words using three models.
Word-PairThank You–HelloThank You–StopThank You–Yes
Subject\Model *123123123
17068.2169.6470.7164.6472.1468.576570
284.6775.2182.8679.2975.3677.1474.6471.0770.71
377.575.7178.9382.1481.4383.5776.7974.6482.86
483.2174.6490.3683.2180.3686.4384.297086.07
572.566.07758066.4376.0768.2169.2975.71
672.8675.7170.3677.1467.1478.2176.0777.575.71
773.5767.1473.9378.5772.1477.8673.217573.21
876.7973.5775.7189.2980.3689.2987.1479.6486.79
976.7962.9374.6459.6464.6458.5769.6467.566.43
1064.2965.7162.572.8669.6476.7968.9367.1471.43
1181.797080.3675.3672.1473.9377.57576.07
1277.8670.3676.7962.570.7167.563.5766.7967.14
1378.9369.2977.1474.2971.0773.2171.7972.1475.71
1468.5765.7162.573.5769.6477.578.2171.0776.43
1585.3679.2984.2983.5775.3682.1473.2163.9377.14
Average76.370.675.776.172.176.774.17175.4
* 3DCNN + LSTM (1), 3DCNN + StackLSTM (2), 3DCNN + BiLSTM (3), Max = boldfaced, Min = underlined.
Table 5. Average accuracies of binary classification for ‘Help me’ set words using three models.
Table 5. Average accuracies of binary classification for ‘Help me’ set words using three models.
Word-PairHelp Me–HelloHelp Me–StopHelp Me–Yes
Subject\Model *123123123
178.5769.6477.8677.573.9374.6468.576577.14
284.6475.7181.4378.2169.6479.2974.6471.0776.07
380.3677.582.1477.1467.582.576.7974.6483.21
478.5768.5776.4371.7962.573.5784.297083.57
568.9368.5764.6470.3663.5771.7968.2169.2978.57
666.7968.9366.0771.796571.0776.0777.570.71
764.6468.5762.578.5778.9379.2973.217584.29
867.8666.7968.9377.1469.6480.3687.1479.6484.29
971.0764.6471.0771.4363.9368.5769.6467.571.79
107572.8681.7367.1470.7169.2968.9367.1470
1169.646567.8672.8679.2971.0777.57583.57
1279.2976.4377.8673.2166.7972.563.5766.7975.36
1364.2962.8658.57756573.5771.7972.1459.29
1469.9469.2965.7172.568.5771.4378.2171.0778.57
1575.3675.7183.5783.5777.584.6473.2163.9378.21
Average7370.172.474.569.574.974.17177
* 3DCNN + LSTM (1), 3DCNN + StackLSTM (2), 3DCNN + BiLSTM (3), Max = boldfaced, Min = underlined.
Table 6. Average accuracy of multi-classification for five imagined words using three models.
Table 6. Average accuracy of multi-classification for five imagined words using three models.
Subject\Model3D CNN + LSTM3D CNN + StackLSTM3D CNN + BiLSTM
141.5744.4341.29
241.144842.71
343.2948.2945.57
447.4349.4348.43
537.144139.57
641.4349.4342.71
750.1449.7149.57
843.4343.5743.29
940.434241.71
1034.1441.1440
1145.2948.5743.14
1234.7141.7139.86
1333.5736.5734.29
1443.144743.43
1533.7139.7138
Average40.744.742.2
Max = boldfaced, Min = underlined.
Table 7. Averaged accuracies for three imagined word classifications using three models.
Table 7. Averaged accuracies for three imagined word classifications using three models.
Subject\Model3D CNN + LSTM3D CNN + StackLSTM3D CNN + BiLSTM
150.7151.950.48
262.3859.0561.9
353.8153.3353.57
467.3868.170.95
548.148.8147.38
659.5263.8161.9
765.2466.4366.19
863.168.5768.33
954.0555.2455
1046.949.0551.43
1160.9561.960.48
1252.1458.3358.57
1362.3860.7159.52
1462.6259.0563.1
1558.5767.8659.52
Average57.959.559.2
Max = boldfaced, Min = underlined.
Table 8. Training times for different models and classification tasks.
Table 8. Training times for different models and classification tasks.
Experiment\Model3D CNN + LSTM3DCNN + StackLSTM3D CNN + BiLSTM
Word-pair classification2.0 h *2.3 h2.1 h
3-class classification3.4 h4.5 h3.7 h
5-class classification4.8 h7.7 h6.2 h
* h (hour).
Table 9. Comparison of recent studies on the performance of imagined word decoding.
Table 9. Comparison of recent studies on the performance of imagined word decoding.
PaperDatasetData
Type
ModelLength
(s)
Average Accuracy %
Word-PairMulticlass
[40]12 Sub. * & 2 wordsPrivateConv-attention380.00-
[41]5 Sub. & 3 wordsPrivateANN based160.35-
[27]15 Sub. & 3 wordsPrivateSVM3-50.10 (3-class)
[15]13 Sub. & 5 wordsPrivateLSTM2-73.56 (5-class)
[42]-Pres.(15 Sub. & 6 words)
-BCI (15 Sub. & 5 words)
Publick-NNPres. (4)
BCI (2)
-45 Pres.(6-class)
48.10 BCI (5-lass)
[43]Kara. (8 Sub. & 11words)PublicSVM585.381.6 (11-class)
[44]One subject & 6 wordsPrivateRF & SVM282.016.67 (6-class)
[45]9 Sub. & 13 wordsPrivateSVM175.5646.54 (13-class)
[16]BCI (15 Sub. & 5 words)PublicCNN-based27659.7 (3-class)
Our workBCI (15 Sub. & 5 words)Public3DCNN-RNNs277.859.5 (3-class)
44.7 (5-class)
* Sub. (Subjects), Kara. (KaraOne DB), Pres. (Pressel. DB), BCI (BCI DB).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alharbi, Y.F.; Alotaibi, Y.A. Decoding Imagined Speech from EEG Data: A Hybrid Deep Learning Approach to Capturing Spatial and Temporal Features. Life 2024, 14, 1501. https://doi.org/10.3390/life14111501

AMA Style

Alharbi YF, Alotaibi YA. Decoding Imagined Speech from EEG Data: A Hybrid Deep Learning Approach to Capturing Spatial and Temporal Features. Life. 2024; 14(11):1501. https://doi.org/10.3390/life14111501

Chicago/Turabian Style

Alharbi, Yasser F., and Yousef A. Alotaibi. 2024. "Decoding Imagined Speech from EEG Data: A Hybrid Deep Learning Approach to Capturing Spatial and Temporal Features" Life 14, no. 11: 1501. https://doi.org/10.3390/life14111501

APA Style

Alharbi, Y. F., & Alotaibi, Y. A. (2024). Decoding Imagined Speech from EEG Data: A Hybrid Deep Learning Approach to Capturing Spatial and Temporal Features. Life, 14(11), 1501. https://doi.org/10.3390/life14111501

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop