1. Introduction
Imaginary speech recognition (ISR) systems have grown in popularity in recent years. These types of systems usually aim to acquire signals from the brain using invasive or non-invasive methods, preprocess them to improve the quality of the signal, extracts the features with the main purpose of concentrating the significant information into smaller data, and finally, classifies the features in order to offer feedback to the user. This processing chain that leads to obtaining the final system must work in real time on portable devices to be actually used in everyday activities by the people who need it. The main purpose of ISR is to come to the aid of persons who, after suffering some disorders (afflictions,) such as the lock-down syndrome, cerebral palsy, etc., lost the ability to speak.
An important part of the ISR systems is represented by the signals acquisition stage. There are several methods of collecting the signals for a brain computer interface (BCI) system, such as electroencephalography (EEG), electrocorticography (ECoG), magnetoencephalography (MEG), positron emission tomography (PET), and functional magnetic resonance imaging (fMRI). The most common method used is EEG. The main advantage of the EEG signals is the fact that the acquisition is non-invasive and low-cost, compared to MEG, PET, and fMRI. However, since the signals are acquired from the surface of the scalp, there are multiple layers between the source and the electrode, which leads to a high attenuation of the signal, and it is more sensitive to noises. The second most used acquisition method for imaginary speech decoding is ECoG, due to the higher quality of the collected data, since the signals are acquired directly from the cortex (invasive). Nevertheless, this advantage also brings the disadvantage of being an invasive method, leading to difficulties in data acquisition.
2. State of the Art
In recent years, recognizing silent speech from cortical signals attracted more and more attention of the researchers. Different approaches were tried over the years, aiming to achieve the best performances.
The first attempts consisted of creating the desired words from letters using different methods of choosing the letters, such as moving a cursor on a monitor [
1] or following a matrix of ASCII characters as the lines and columns are highlighted [
2]. A successful real-time communication system, based on creating the words letter-by-letter, was developed by Ujwal Chaudhary et al. [
3] on a patient with amyotrophic lateral sclerosis (ASL). Due to the degradation of the motor functions, the only way of communication for the patient was by using brain signals. The researchers implanted a 64-microelectrode array, which allowed the subject to modulate neural firing rates after receiving audio feedback with a block of letters first and each letter from the block after to create words. After days of training, on the 245th day after implantation, the patient was able to create complex sentences, such as, “I would like to listen to the album by Tool loud”, with a mean of one character per minute.
These methods offered significant results and have the advantage of improving over time as the subjects practice mastering the system. However, this type of communication is difficult, unnatural, and takes relatively a lot of time to form a word (one character per minute [
3]).
Other approaches targeted decoding the speech directly from the imagined words or phonemes. These methods assume that, during imaginary speech, the brain has different marks of electric activity, according to different pronunciations of each word.
Trying to find different marks of imaginary speech, Thimotee Proix et al. [
4] investigated the ECoG signals acquired during overt and imagined speech. In the conducted study, the researchers first tried to find similarities between overt and imagined speech. They computed the power spectrum of four frequency bands, theta (4–8 Hz), low-beta (12–18 Hz), low-gamma (25–35 Hz), and BHA (80–150 HZ), and they observed that the BHA power for both overt and imagined speech increased in the sensory and motor regions, while beta power decreased over the same regions. Nevertheless, when trying to differentiate different syllables in binary classes, articulatory, phonetic, and vocalic, the results for imaginary speech significantly dropped, compared to overt speech in the BHA band. Better performance was recorded when using power in the beta band, but with room for improvement. Significant improvements of the results were obtained when the trials were aligned on recorded speech onset, which was impossible to do for the imaginary speech.
Other study based on ECoG signals managed to obtain important results when classifying five different words in a
patient-specific system [
5]. The researchers used high-gamma time features, which were aligned using dynamic time warping (DTW), and finally obtained a matrix feature by computing the DTW-distance between the realigned trials. These features, combined with support vector machine (SVM), offered encouraging results for a five-class classification task, obtaining a 58% mean accuracy for all five subjects. In 2019, Miguel Angrick et al. [
6] managed to synthesize speech from
ECoG signals by also using the
speech signal acquired, together with the ECoG signal. The obtained results registered a correlation between the reconstructed speech signal and the actual speech, higher than the chance for all the analyzed subjects. Moreover, one of the subjects had a correlation of 0.69, which allowed for a good reconstruction of the vocal signal.
Recent studies of stereoelectroencephalography (sEEG) signals regarding decoding the imaginary speech reported the development of a real-time system with speech synthesis for only one patient, due to the difficulty of data acquisition [
7]. The researchers processed the neural signals by extracting the multichannel high-gamma band as features that were decoded by assigning the neural activity to a mel-scaled audio spectrogram. Further, a regularized LDA-classifier was trained to predict the energy level for each spectral bin. The results reported an average 0.62
0.15 Pearson correlation between the reconstructed signal and the real speech signal, but the reconstructed speech was not yet intelligible.
However, even if ECoG and sEEG offered a good representation of the neural activity with a high signal-to-noise ratio, they were still invasive methods of acquiring signals from the brain, which leads to great limitations regarding collecting the signals and being easily accepted by the patients. The alternative to these invasive methods can be functional magnetic resonance imaging (fMRI), functional near-infrared spectroscopy (fNIRS), magnetoencephalography (MEG), and electroencephalography (EEG). All these methods have their limitations: fMRI is expensive, non-portable, and has a poor temporal resolution, fNIRS takes time to collect the information from the brain (aprox. 2 to 5 s), which makes it harder to aim a real-time system, and MEG is expensive and non-portable. Finally, EEG remains the best option for a real-time noninvasive wearable device, while having the limitation of small amplitude data with a higher signal-to-noise ratio, comparing to ECoG.
In recent years, several studies have used EEG signals for imaginary speech recognition. In study [
8], the researchers analyzed six different words: “could”, “yard”, “give”, “him”, “there”, and “toe”, acquired from 15 subjects. In the feature extraction stage, the signals were decomposed using the Daubechies-4 (db4) mother wavelet to eight levels corresponding to the delta, theta, alpha, beta, gamma bands, and additionally, three other bands, <2 Hz, 64–128 Hz and 128–256 Hz, where features such as root mean square, standard deviation, and relative wavelet energy were computed. The features were fed into a random forest (RF) classifier and support vector machine (SVM). The registered results rose above the chance for both classifiers, having a mean accuracy for all 15 subjects of 25.26% RF and 28.61% SVM.
A great breakthrough in the field came along with the open access datasets, such as the Kara One Dataset [
9] and Nguyen et al. [
10] research dataset. This helped the community to develop research in the imaginary speech field, without having to go through the complicated process of data acquisition, and made it possible to compare the results with other studies developed based on the same environment.
Using the Kara One Dataset, Panachakel et al. [
11] managed to develop a
patient-specific system based on computing a set of statistical features, root mean square, variance, kurtosis, skewness, and the 3rd order moment, over the obtained signals after decomposition using db4 mother wavelet into seven levels. Afterwards, the signals were fed into a deep learning architecture with two layers of 40 neurons. The results offered a 57.15% mean accuracy for all subjects, significantly higher compared to the results obtained without using deep learning. With a more complex deep learning architecture, the researchers from University of British Columbia managed, in their paper [
12], to present a generalized system for imaginary speech
binary classification, achieving the best accuracy of 85.23% for discrimination between consonant and vocal phonemes. The results were achieved using the covariance matrix between the channels as a feature matrix and a combination between convolutional neural networks (CNN), long-short term memory (LSTM) neural networks, and a deep autoencoder (DAE).
In 2021, a larger dataset was collected in Moscow, Russia, containing signals acquired from 270 healthy subjects during the silent speech of eight Russian words: “forward”, “backward”, “up”, “down”, “help”, “take”, “stop”, and “release”. The patient-specific developed system offered an accuracy of 84.5% for all nine words classification and 87.9% for binary classification using the deep learning architecture of ResNet18, in combination with two layers of gated recurrent units (GRUs). Finally, the researchers claimed that there are strong differences between the signals from different subjects, and it is more plausible to develop a patient-specific system with a high accuracy than a generalized one.
Deep learning architectures showed a better performance of imaginary speech signals and gained popularity over the years. Recently, EEG studies started to concentrate on the LSTM neural network, due to the advantages it has for a continuous series. The LSTM neural network offered significant results on applications such as epilepsy prediction [
13] and imaginary speech recognition [
14]. The results obtained by the researchers in [
14] registered a maximum accuracy of 73.56% for a subject-independent system and a
four-class problem: “sos”, ”stop”, ”medicine”, and ”washroom”.
In this paper, we focused on developing a subject’s shared system for imaginary speech recognition. By the subject’s shared system, we mean that one system was developed for all registered users in the database. However, this system was different from a patient-specific system in such a matter that, when introducing a new subject in the database, it will only need a fine-tuning of the neural network and not an entire subject training. During our research, we used Kara One database, which was preprocessed for further usage. Cross-covariance in frequency domain was computed in the feature extraction stage, and all the signals were further introduced into a CNNLSTM neural network. We showed that CNNLSTM performed better than the CNN neural network by increasing the accuracy from 37% to 43%. In the paper, the system behavior after reducing the number of channels and selecting the channels from the main brain areas and the areas corresponding to the anatomical structures involved in imaginary speech production was also tested. We showed that the channels corresponded to the anatomical structures involved in speech production concentrates 93% of information, obtaining a maximum accuracy of approx. 40%. Even if the accuracy dropped by 3%, a lot more was gained in terms of time of execution, comfort, portability, and costs.
The system was developed using python as programming language, and the CNNLSTM neural network was developed using TensorFlow, an end-to-end open-source platform for machine learning [
15].
4. Results
This paper aimed to develop an intelligent subject’s shared ISR system for differentiating seven phonemes and four words of the Kara One database. The signals from the database were preprocessed, in order to obtain a better quality of the data and were further introduced to a feature extraction stage, with the purpose of extracting the main information hidden in the EEG signals regarding imaginary speech. The features used were based on the inter-channel covariation in frequency domain, a method of feature extraction first introduced in the ISR domain in the study [
17], which has the main advantage of encoding the variability of the electrodes when a stimulus is sent by the brain to the motor neurons involved in the process. Another advantage of the method consists of eliminating the possible delays of brain impulse over the channels, due to the computation in frequency domain, with respect to the time domain, first introduced to imaginary speech by Pramit Saha and Sidney Fels in their study [
12]. In the classification stage, we used a 2D CNNLSTM neural network in order to connect both the spatial and temporal correlations between the different windows and the electrodes.
During the development of the system, we also studied the system performances for different regions of the brain, frontal (F), central (C), occipital (O), and different areas, left (L) and right (R) and combinations between them:
FC = frontal and central;
CO = central and occipital;
FCO = frontal, central and occipital;
Frontal left and right;
Central left and right;
Occipital left and right.
To these regions, we added an ASBA (anatomical brain areas involved in speech) region. Considering the nowadays inclination to develop systems that can be easily transposed to a portable device, reducing the number of channels can be a big breakthrough towards this aspiration. By using only the electrodes that are directly responsible for the speech production, the device is more likely to be accepted by the users. Another advantage of selecting the channels consists of reducing the memory and the execution time of the system.
4.1. CNNLSTM vs CNN
This study focused on highlighting the advantages of the 2D CNNLSTM neural network in recognizing imaginary phonemes and words using the inter-channel cross-covariance method in the feature extraction stage. The results achieved were compared with the ones recorded in study [
17], where the same feature extraction method was used with a CNN neural network as classifier. The results comparison can be observed in
Table 3.
The CNNLSTM neural network managed to increase the accuracy from 0.3758 to 0.4398, according to
Table 3, when the same preprocessing and feature extraction processing chain was followed. This difference in accuracy is due to the ability of the LSTM neural network to learn the long-term dependencies of time series in combination with the 2D CNN neural network, which encodes the spatial correlation between the electrodes too.
The mean confusion matrix of the k-folds is presented in
Figure 5 and reveals that there is hardly any confusion between the phonemes and words, as expected after the visualization of the LDA dimension reduction results in
Figure 2.
An analysis of the distances between the mean of each class for every phoneme and word was conducted in
Table 4 and
Table 5, with the cross-covariance computed in frequency domain features (
Table 4) and for the features computed by the CNNLSTM neural network after the first CNNLSTM layer (
Table 5). This analysis aims to highlight the role of the CNNLSTM layers in the feature extraction adaptive process, designed to transform the input space and to increase the separability between the classes.
To compute the Euclidean distance between the mean of each class, first the input matrix was reshaped into a matrix with the size N × N
features, where N is the number of input vectors, and N
features is the total number of features corresponding to N
window × N
channels × N
depth for the cross-covariance matrix in frequency domain and N
window × N
channels × N
channels × N
depth for the features computed by the first layer of the CNNLSTM neural network. All features were normalized in the range [0, 1], using Equation (12), before computing the Euclidean distances (13).
where
is the input vector
,
,
j,
is the minim value of the vector
i for the feature
j, and
is the maximum value of the vector
i for the feature
j.
The Euclidean distance between the mean of utterance
u1, and utterance
u2 was computed as:
where
is equivalent to the mean of all utterance u, computed as:
It can be seen from
Table 4 and
Table 5 that the Euclidean distances of the means increased significantly after the first convolutional layer. This growth highlights the important role of the convolutional layer in transforming the feature space into a space with a better separability between classes.
Table 6 and
Table 7 summarizes the minimum and maximum of the computed Euclidean distance for the phonemes and words, together with the minimum and maximum of the standard deviation of the features to emphasize that, even if the averages move away from each other, the standard deviation has not undergone significant changes. The standard deviation of each feature for each class was computed using the following equation:
where
is the standard deviation for each feature for the utterance
u.
4.2. Brain Areas Analysis
The second study conducted in this paper was based on evaluating the performance of the system when using a smaller number of electrodes from different regions of the brain.
Table 8 details the region selected for the study, along with the electrodes used for that region.
The regions were selected at first based on the major brain areas corresponding to the EEG electrodes, i.e., frontal, central, and occipital areas, and afterwards based on the anatomical brain areas involved in speech conceptualization and articulatory plans, initiation, and coordination of the motor stimulus to be sent to the effectors. Regarding the anatomical brain areas involved in speech, is well-known that the Broca area has an important role in speech production [
23]: the primary motor cortex generates the signals that control the execution of the movements and was discovered to also relate with the motor imagination [
24], and the secondary motor area is responsible for motor planning [
25]. The next step was to identify the spatial position of each anatomical structure identified to be involved in speech, regarding to the electrodes positioned in the 10-20 system. The Broca area is positioned in the posterior half of the inferior frontal gyrus; the primary motor cortex corresponds to the precentral gyrus, and the secondary motor cortex is mostly identified as Brodmann area 6, which is positioned in the precentral gyrus, the caudal superior frontal gyrus, and the caudal middle central gyrus. These anatomical spatial structures were identified as electrodes in the 10-20 positioning system by the researcher L. Koessler et al. in [
26] as the channels selected for the anatomical speech brain areas detailed in
Table 8.
For a better perspective of the analyzed brain areas,
Figure 6 presents the 10-20 system for the electrode positioning used to acquire the signals from Kara One database having the selected regions colored differently.
The obtained results for each brain area and the combinations of the brain areas studied in this paper are presented in
Table 9.
As expected, after selecting fewer channels corresponding to different brain areas of interest, the system classification accuracy decreased. The best accuracy was achieved when using the electrodes from the identified anatomical speech brain areas, according to the specialized literature, from both the left and right hemispheres, reaching a value of approx. 40%. Although there was a decrease in performance, compared to using all the available channels, and there were advantages regarding the computation, memory, costs, and even the comfort of using a smaller number of electrodes.
4.3. Complexity and Memory Analysis
The tendency of an ISR system is to perform with the highest possible accuracy on a portable device with limited resources. There will always be a tension between the performance and the complexity when developing an ISR system. In this section of the paper, we focused on a study regarding the complexity and memory of the developed system.
When it comes to the complexity and memory of an intelligent system, the primary resource consumer is generally the neural network. In the case of a CNN neural network with long-term memory, the complexity is given by the long-term memory convolutional layer, due to the need of the implied gates output computation, i.e., the input gate, the forgetting gate, and the output gate, considering the 2D spatial correlation in this process, too. For the 3D convolution corresponding to the input tensor (N
lines × N
columns × N
channels) and the kernel (k
lines × k
columns × k
channels), the complexity can be measured as:
when using the fast Fourier transform, as described in [
27].
In our case, the
=
, =
, and
=
, so the Equation (16) becomes:
where
is the input lines/columns,
is the kernel dimension, and
is the number of input channels.
However, for a ConvLSTM2D layer, this computation (described in Equation (17)), will be made for all the filters of the layer (
), for all windows (
), and for all gates, plus the computation of the last cell output (a total of four computations). Finally, the complexity can be approximated to:
The details regarding the complexity for all system stages and for all specific layers of the used neural network are presented in
Table 10, along with the memory consumption and the execution time measured using an AMD Ryzen 7 4800HS CPU.
5. Discussion
This paper aimed to develop an intelligent system for the imaginary speech recognition of seven phonemes and four words from the Kara One database. To achieve our goal, we passed the signals from the database through a preprocessing stage, in which an expert analyzed the imaginary speech epochs and eliminated the ones with high noises. Afterwards, in the feature extraction stage, we computed the cross-covariance between the channels in the frequency domain to conserve the channels connections in a compact matrix. In this stage, we showed, by using the LDA algorithm for feature visualization in a 2D feature space, that this method is better for decoding the features than using the signals in time or the cross-covariance applied over the channels in time. We used, as a classifier, an CNNLSTM neural network and concluded that it performed better than a CNN (in comparison to paper [
17]), due to the consideration of not only the space correlations, but of the time variations, too. We were looking to develop a system that can easily be passed to a low-cost portable device, so we also studied the possibility of using a smaller number of electrodes divided into different brain regions. Finally, we also studied the complexity of the algorithm and the time required for the system to decide the membership of an input.
5.1. LDA for Feature Extraction Analysis
In the feature extraction stage, the signals were passed through a feature reduction algorithm used for the visual analysis of the three different features: the signal in time without processing, the cross-covariance in time, and the cross-covariance in the frequency domain. We observed that, when using the cross-covariance in the frequency domain, the features were clustered into the desired classes, however overlapped, but obtaining a good distinction between the phonemes and the words. We also observed that the distances between the means got higher using this type of feature extraction. When using the cross-covariance in time domain, the phonemes and the words were distributed in clusters, but almost completely overlapped, which made the classification harder. The computed means of each class was higher, compared to the raw feature signal, but the separability between the phonemes and words was lost. We observed that the means of the raw signals were close to each other for phonemes and for words, but a distinction between the two was clearly made.
5.2. CNNLSTM vs. CNN
The main advantage of a LSTM neural network is its capability of memorizing long-term dependencies. This neural network ability comes in handy when non-stationary time variant signals, such as EEG, are analyzed. In addition to the LSTM long-term memory, the spatial dependencies of the EEG signals were taken into consideration by adding CNN to the LSTM layer. This junction helped the neural network to learn both spatial and time-variant features, increasing the accuracy of the system from 0.37, obtained with a CNN architecture with similar parameters up to 0.43.
The mean confusion matrix of the k-folds presented in
Figure 5 reveals that there is hardly any confusion between the phonemes and words, as expected after the visualization of the LDA dimension reduction results in
Figure 2. Confusions were made between the phonemes /tiy/ and /diy/, with a relatively higher percent, compared to the rest of the phonemes. This behavior can be explained by the similar mechanisms of the pronunciation of the sound “t” and the sound “d”. For both utterances, the vocal tract maintains almost the same position, and the sound is being produced by presence or absence of vocal cord vibration. Major confusions of approx. 19% were also made between the words “pat” and “pot”, for the same reason of similar mechanism of pronunciation.
In
Table 4, we computed the Euclidean distance between the mean of each phoneme and word analyzed for the cross-covariance in the frequency domain. We observed that these distances are smaller than the ones computed for the extracted features obtained after the first CNNLSTM layer (
Table 5), meaning that, after the signal was processed by the first neural network layer, the feature space of the utterances was modified, and the features became more separable.
5.3. Brain Area Analysis
The final goal of an automatic ISR system is to obtain the best possible accuracy using a portable device with limited resources. Considering this, we further studied the system behavior by using a smaller number of electrodes, located in specific areas of the brain, for classifications. This approach enhanced the portability of the device and decreased the needed resources used for development, but with the cost of accuracy, as can be seen in
Table 9.
When the number of channels was reduced, the accuracy dropped, as well. However, when using the electrodes positioned on the anatomical regions of the brain responsible for speech intention and production, the accuracy of the system reached 0.40, a value with a drop of only 3%, in comparison to all channel accuracy. This means that 93% of speech information is concentrated in these channels, and only 7% of information is distributed to the parietal and occipital regions. The main advantage of using only the channels corresponding to the anatomical structures of the brain regarding speech is that the number of electrodes is reduced, in this case by more than half, from 62 to 29, which is a big computational and cost gain.
Important information obtained after the study of the brain areas involved in imaginary speech recognition is that the data acquisition using the visual stimulus appearing on a prompt does not affect the study because the occipital area is less involved in the decision-making of the system. This is due to the two-second period introduced in the protocol design, the time between the prompt display and the actual imagination of the phoneme.
5.4. Complexity and Memory Analysis
An important aspect when developing an ISR system is considering the complexity of the algorithm and the memory used. Usually, the main consummator of the resources is the neural network, as can also be seen in
Table 10. The maximum number of operations is given by the second layer of the CNNLSTM neural network and is to the order of approx. O (6.3 × 10
9). However, the execution time for a decision is under 100 ms, even when using all the channels in the computation, which means that it can still be implemented in a real-time device.
When reducing the number of electrodes, we can see in
Table 11 an important decrease in the time execution, as well. An alternative to an all-electrode system can be the ASBA electrodes system, in which the time of execution considerably dropped from 80 ms to 20 ms, with a drop in accuracy of only 3%. This alternative is better from other points of view, too, such as increasing the user comfort when using the device and decreasing the cost of the final product.
When it comes to the memory usage, the system has its limitation because it needs at least 2 GB of memory only to retain the weights, due to the gates and the long-term memory implication of the LSTM network. However, the LSTM addition to the system improved the overall classification and added great value to the final system.
6. Conclusions
This paper aimed to develop a subject’s shared system for recognizing seven phonemes and four words collected during imaginary speech from the Kara One Database. To achieve the proposed goal, the database was preprocessed and features were computed, in order to decode the hidden information regarding imaginary speech. The features were based on computing the cross-covariance in the frequency domain and were introduced to a CNNLSTM neural network for final classification in the desired classes.
During our research, we observed that, by computing the cross-covariance in frequency domain, the feature space turned into a space where the utterances could be separated easily. This conclusion was drawn after analyzing the feature space in two dimensions by computing the LDA algorithm for feature reduction.
Another feature comparison was made between the extracted features and the features obtained after the first CNNLSTM layer.
Table 4 and
Table 5 represent the Euclidean distances between the mean of each class, and it can be easily seen that these distances increased after the CNNLSTM layer. This suggests that the feature space of the input vectors changed, so that the classes became more separable. This is one of the advantages of deep neural networks, especially CNN networks, which use the first layers for feature extraction before the actual classification of the data.
This paper also showed an improvement for the system performance when using a CNNLSTM neural network, with respect to CNN. The accuracy increased from 37% to 43% when using CNNLSTM and the same processing chain of the database. The advantage of CNNLSTM is the consideration of both the spatial and temporal connections. CNN uses the convolution for the spatial connection between the channels, while LSTM brings the long-term memory that is vital for the non-stationary time variant signals, such as EEG.
The proposed system considers the portability of a possible real-time portable device. Therefore, we also studied the system’s behavior when reducing the number of channels for classification. We separated the electrodes into main areas, i.e., frontal, central, and occipital, studied the areas involved in imaginary speech production, and selected the channels accordingly for the anatomical speech brain areas study. We concluded that 93% of the information is concentrated in the anatomical speech brain areas, obtaining an accuracy of 40% for 29 channels used, in comparison to the 62 used in the beginning. Even if the accuracy dropped by 3% using only 29 electrodes, the system brings more advantages in a matter of time for execution, portability, comfort, and cost.