1. Introduction
The utilization of assistive technologies, particularly electric wheelchairs, has experienced a significant surge in recent decades, playing a pivotal role in supporting individuals facing mobility challenges and enhancing their overall quality of life. The World Health Organization (WHO) reports that over 1.3 billion people globally grapple with structural or functional impairments, with at least 80 million necessitating the use of wheelchairs [
1]. While conventional control methods like keyboards, mice, joysticks, or touchscreens are effective for healthy users, they often prove impractical for those severely disabled due to conditions such as spinal cord injuries, paralysis, muscular dystrophy, multiple sclerosis, or stroke. Traditional control methods, such as joystick-based interfaces, exhibit limitations in usability, particularly for those with severe motor disabilities [
2]. Consequently, the development of an efficient interface enabling users with physical disabilities to convey their intentions or commands to assistive devices becomes imperative.
In response, researchers have turned to innovative technologies, notably BCIs, to devise, design and propose alternative and more intuitive control strategies, as demonstrated by Rebsamen, B. et al. [
3]. BCIs establish a direct communication link between the human brain and external devices, eliminating the reliance on traditional motor control mechanisms [
4]. This technology holds immense potential for transforming the lives of individuals with limited mobility, granting them the ability to control assistive devices, such as wheelchairs, through their brain activity as discussed in [
5].
Various techniques have been employed for wheelchair control, with low-level motion control systems converting EEG data into direct motion commands for wheelchair operation as demonstrated in [
5]. Pioneering in this field, Tanaka et al. [
6] introduced an electric wheelchair controlled by the user’s brain, pioneering a groundbreaking device. By visualizing left or right limb movements during Motor Imagery (MI) tasks, users can dictate the direction of the subsequent movement.
Several approaches have been explored in the realm of BCI-based wheelchair control. MI tasks, wherein users mentally simulate specific movements, have been leveraged to detect intention and initiate corresponding wheelchair motions, as reported by Pires, C. P. et al. [
7]. Additionally, Steady-State Visually Evoked Potentials (SSVEP) and P300-based BCIs have been employed to achieve higher accuracy in control through visual stimuli and attention-based paradigms, as proposed by Ortner, R. et al. [
8].
Despite the promising results demonstrated by these BCI-based control methods, there remain challenges that require attention for practical implementation. Issues such as inter-subject variability in brain rhythms, the need for prolonged visual stimulation, and the computational complexity of hybrid control approaches pose limitations to achieving seamless and efficient wheelchair control [
9].
In this paper, we present a human–machine interaction method for direct control of a robotic wheelchair based on hybrid control using facial expressions captured by a webcam and EEG signals. The main outcome of our research is the ability to recognize the user’s control expression based on their gaze direction. This enables a more natural interaction style that could help with continuous control of the wheelchair.
Specifically, our method is dedicated to wheelchair users who are severely disabled but can perform basic tasks with the help of their eyes and heads to operate the wheelchair. To do so, we collected EEG data with an Emotiv Insight headset and used machine learning algorithms (e.g., CNN-LSTM) to recognize signal patterns triggered by various facial expressions, such as Smile (Backward), Eye Blink (Stop), Wink Left (Rotation Left), and Wink Right (Rotation Right). We used a forward-facing camera to track the user’s head movements using computer vision methods.
This paper is structured as follows:
Section 2 provides a review of related work, while
Section 3 details the proposed system architecture.
Section 4 delves into our methodology, offering insights into data acquisition, analysis, and classification algorithms. Tests and results are presented and discussed in
Section 5. Finally,
Section 6 is dedicated to the findings and conclusions of the article.
2. Related Work
On one hand, patients’ data are private but on the other hand, remote applications, such as e-health ones, are becoming more and more common, and researchers’ primary concern is improving user security. Furthermore, unauthorized users and even hackers pose a major risk to security and privacy. As a result, it becomes extremely difficult to provide effective e-health services while maintaining patient data availability, privacy, and validity. Without a doubt, the first prerequisite is access control, which facial recognition effectively ensures. This section examines related work from two perspectives: embedded solutions for some secured applications and access control mechanisms.
In [
10,
11], the technique employed for electrophysiological measurements was LORETA (Low-Resolution Electromagnetic Tomography of the Brain). LORETA is based on solving inverse problems to calculate the three-dimensional distribution of neuronal electrical activity. This method serves as a linear estimation technique for determining the sources of signals in the human brain without requiring additional data to be added to the EEG signal. In [
10], the authors addressed the removal of biological artifacts, such as the subjects’ anxious tics and short squinting of the eyelids. Furthermore, the authors of [
12] studied the most common type of Electrooculographic artifact, namely eye blinking. In [
13], authors combined an EEG with the power spectrum of eye blink artifacts to develop a brain–computer interface. They have further conducted a spectrum analysis of EEG data for patients suffering from insomnia.
The authors of [
14] focused on artifact reduction in the BCI hybrid. The researchers developed an approach that combines stationary wavelet transforms with adaptive thresholding to effectively remove artifacts from EEG signals. The study in [
15] employed the IC MARC classifier to investigate the impact of various artifacts on a motor imagery-based Brain–Computer Interface (BCI) system. The findings demonstrated that when utilizing all 119 EEG channels, muscle artifacts had a detrimental effect on BCI performance. This was observed by comparing the results to a configuration with 48 centrally placed EEG channels. In [
16], the authors introduced a new method for automatically eliminating eye-related EEG artifacts using independent component analyses and outlier identification techniques. The OD-ICA method demonstrated effectiveness in removing Ocular Artefacts (OA) while preserving significant EEG signals. Peak detection in online EEGs for BCIs, explored in [
17], addressed the impact of filtering on BCI performance, emphasizing peak frequency detection. While peak detection improved with the filter, the BCI performance suffered from movement and increased artifact removal. The issue of noisy EEG data was resolved in [
18] by employing the LombScargle periodogram for spectral power estimation and a denoising autoencoder (DAE) for training, successfully decoding insufficient EEG recordings.
Networks and highly classified expert systems, including AI solutions, are important. In [
19], a convolutional neural network (CNN) was employed to perform skin cancer classification. Furthermore, [
20] utilized a deep neural network (DNN) to classify histopathologic images of breast cancer. In contrast, ref. [
21] employed a hybrid convolutional and recurrent deep neural network for classification purposes. Automation system control through artificial intelligence was addressed in [
22], presenting a method for controlling a robot using an algorithm based on an artificial neural network.
Currently, several studies exist on deep learning theories, such as convolutional neural networks (CNNs). These techniques prove their effectiveness. The authors of [
23,
24] suggested a research strategy for producing 3D channel spectrograms that combines three different time–frequency representations (spectrograms, gamma-tone spectrograms, and continuous wavelet transform). Applications such as the automatic identification of phoneme classes for phone attribute extraction and the diagnosis of speech impairments in cochlear implant users have been successful.
In the context of wheelchair control, Ba-Viet et al. (2020) explored wheelchair navigation using EEG signals and 2D maps with a camera [
25]. Other researchers investigated an intelligent wheelchair using eye detection and visual systems (Dorian 2021 [
26], Agnes 2022 [
27]). Notably, these experiments revealed a notable disparity in perceived convenience, referred to as preferred control, due to the inherent instability of photos influenced by light and the sensitivity of the EEG signal to an individual’s state. Our study aims to enhance control system performance through the combination of facial expressions and neural signals.
4. Methodology
In the “webcam-based command” section, we will present the fundamental structure of the 3D-CNN connected to image classification. Moving on to the “EEG headset-based command” section, we will delve into the essential components of the 1D-CNN-LSTM approach for processing EEG data. Lastly, in
Section 3, we will introduce a feature fusion network that combines the two aforementioned methods and elaborate on our selection process for the network parameters.
4.1. Webcam Based Command
4.1.1. Detection of Face, Eye, and Mouth
The image acquisition and processing classification system, illustrated in
Figure 4, consists of three primary steps. The first step employs the Open-CV library for face recognition. In the second step, learning takes place, followed by the third step, which involves image classification. Deep learning algorithms, specifically convolutional neural networks, are utilized for both learning and classification, enabling matching with the database. Ultimately, the classification results are displayed to the user.
4.1.2. Report of Closing and Opening Emotion
This method is used to separate the difference between blinking and winking left/right, naturally or artificially. Blinking is the rapid closing and reopening of the human eye. Each individual has a slightly different blinking pattern. The pattern differs in the speed of closing and opening, the degree of pressure on the eye, and the duration of blinking. The eye blink lasts approximately 100 to 400 ms [
29].
We used four expressions of Wink Right, Wink Left, Blink, and Smile with measurements of the Eye Aspect Ratio (EAR). There are numerous algorithms for face recognition, but we will only focus on Dlib’s method in this paper. Dlib uses the HOG (Histogram of Oriented Gradients).
To do this, we must first localize the human face in the overall image. Face detection is a technique that identifies a human face in an image and returns the value of the bounding box or rectangle associated with the face in x, y, w, and h coordinates [
29].
We must first determine the position of the face in the image before determining the its smallest features, such as the lips and eyebrows. By using points within this rectangle, the face recognition software can identify all the necessary features of a human face.
The 68-point model of Dlib is shown in
Figure 5, where points from 1 to 68 are visible. We will discuss how to recognize these emotions (Blink, Wink Left, Wink Right, and Smile) and how to recognize the EAR.
We used state-of-the-art facial feature recognition systems to localize the contours of the eyes and eyelids. From the 68 points in the image, we derived the EAR, which was used as a parameter to estimate the opening state of the eyes, as shown.
where p1, p2, p3, p4, p5, and p6, shown in
Figure 5b, are the same points found on the circumference of the left eye, respectively 37, 38, 39, 40, 41, and 42 in
Figure 5a. Using this metric, a classifier was used to recognize eye blinks and left and right winks.
The distance between the corners of the mouth increased. However, since different people have different mouth sizes, we normalized this metric by dividing it by the jaw distance to obtain a general ratio that can be used on different people.
In our base detector, we will use the x-coordinates of points 49, 55, 3, and 15 to calculate the EAR_Smile defined in Equation (2). p49, p51, p53, p55, p57, and p59 are the same points found on the circumference of the mouth in
Figure 6.
4.1.3. Algorithm of Classification
The task of emotion estimation is to determine whether an eye blinks, is winking left, is winking right, or is smiling. In cases where faces are not fully frontal, the proposed CNN was used to extract more features that are resilient and effectively classify four emotions. To improve the performance of these methods, we tracked a sequence of images instead of inputting a single image [
31,
32].
The neural network topology (3D-CNN) is depicted in
Figure 6. CNN has extensive applications in various domains, such as natural language processing, recommendation systems, and image and video recognition. These networks were utilized in our situation to process images for smiling, winking to the left or right, and blinking of the eyes.
Table 2 displays the architecture of our neural network. Filtering the input signals is the responsibility of the convolutional layer. Its objective is to minimize training time and minimize data volume while preserving quality. It has a max-pooling layer connected to it. This process was performed three times on Layers 1, 2, and 3 as indicated in
Figure 6. By connecting to a fully connected class, this layer establishes the relationship between a class and the positions of features in an image. Essentially, the thick layer modifies the size of the tensor, which is the basic data structure that forms the basis of all machine and deep learning methods.
4.2. EEG Headset-Based Command
4.2.1. Data Acquisition
To obtain EEG data with the EMOTIV Insight headset, subjects were exposed to 4 different emotions (Smile, Blink, Wink Left, and Wink Right) for 5 min, so that EEG data were available for each subject for a total of 20 min. A twenty-minute data collection protocol was created, with the different emotions first recorded in a separate file. Subsequently, all emotions of each person were consolidated into one file to avoid interference between different expressions and persons. This consolidation also ensured a balanced distribution of each emotion and facilitated a more accurate analysis without compromising the integrity of the dataset.
The data collected in the different situations mentioned above need to be preprocessed for later use in the machine learning component. To better understand the preprocessing, the data source as well as the format and features are discussed below.
The 5-channel EEG device provided 128 samples for each channel in a second. After transforming the collected data for the frequency domain, each collected datum is represented by the weighted and arithmetic mean for each of the 5 device channels and the 4 wave classifications.
4.2.2. Preprocessing
The process of modifying, resolving, and organizing data inside a dataset to make it generally consistent and ready for analysis is known as data cleaning. To ensure the best possible analysis, this entails purging any corrupted or unnecessary data and formatting these in a computer-readable manner [
33].
It is therefore important to carry out proper data cleaning to ensure that the best possible results are obtained.
Data cleaning is composed of six steps:
Remove irrelevant data;
Duplicate data;
Fix structural errors;
Handle missing data;
Filter outliers;
Validate the data.
The process of making altered copies of a data set is known as data augmentation, and it is a mean of artificially expanding the training set. This entails either creating new data points via deep learning or making small adjustments to the dataset. We used the EEG data to use this data augmentation method. To expand the breadth and diversity of the training set, some small changes were made to the original data before generating these new data. This included removing duplicate data and handling missing data by calculating the average of the data before and after these data [
33].
These data were derived from the original data, with some minor modifications to increase the size and variety of the training set. These are examples of straightforward data augmentation techniques which include random swapping, insertion, and synonym replacement.
4.2.3. Algorithm of Classification
Figure 7 represents the architecture of our system. In this work, we processed EEG signals for the estimation of facial expressions using a CNN and LSTM combination [
34]. The ability to extract robust features from CNNs and go beyond the drawbacks of conventional techniques is a true advantage of CNN networks. Three layers make up the network: LSTM, max-pooling, and convolutional. The convolutional layer filters the input signals. The max-pooling layer reduces the size of the data while preserving their features, which helps to shorten the training period. Using the EEG database, the LSTM layer trains the model and performs classification [
34].
Table 3 provides the specifics of our neural network’s construction. The EEG modalities’ 1D-CNN-LSTM architecture is displayed. Sequence data include EEG recordings that last one second. Two one-dimensional convolutional layers, Conv Layers 1 and 2, were then applied to this sequence as is indicated in
Figure 7. Max-pooling and ReLU activation layers came next for each, allowing for the direct extraction of temporal properties from the time series data. The collected features were then flattened for the LSTM layer. This LSTM layer determines the order link between the gathered temporal features to categorize time series data. Next, by connecting the layer to a fully linked class, the relationship between the locations of features in the EEG data and a class was established. In essence, the dimensions of the tensor are altered by the dense layer.
The combination of Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs) offers several advantages in the processing of signals. CNNs excel at extracting local features from inputs by utilizing convolution operations and pooling layers. They are particularly adept at capturing patterns in spatial dimensions. On the other hand, LSTMs possess exceptional capabilities when it comes to processing sequential data, such as time series or EEG signals, due to their ability to handle long dependencies through gated recurrent units.
4.3. Fusion of EEG Signals Decision and Image Processing Decision
Implementing layer fusion for decision-making involves combining the outputs of multiple layers or models to arrive at a final decision. This fusion process can be achieved using various techniques, such as voting, averaging, or weighted averaging. Here is an example of implementing layer fusion for decision-making.
In this work, we first obtained the predictions from each model for the given input data. Next, we performed fusion to combine these predictions and made a final decision based on the fusion result.
Our goal was to increase the recognition accuracy of these emotions (Smile, WinkL, WinkR, and Blink) by merging webcam images with EEG data. Webcams function as a visual supplement to the EEG data, capturing the dynamic interaction of facial emotions and eyebrow movements. The integration of the image processing classifier and the EEG classifier is depicted in
Figure 8. Before performing fusion between two of these classifications, we worked on two classification algorithms: 1D-CNN-LSTM for processing EEG signals from the EMOTIV Insight and 3D-CNN for analyzing pictures.
We use voting-based fusion to count the occurrences of each prediction and select the one with the highest count as the fused prediction. Based on the fused prediction, we made a decision by mapping it to the corresponding class label.
In a deep learning algorithm, the concatenation layer is used to combine the outputs of multiple layers or branches of the network. This can improve the network’s ability to learn complex relationships by allowing it to access features learned at different levels of abstraction. Regularization techniques, such as dropout and Webcam/EEG signal regularization, are used to prevent overfitting of the network to the training data. Exclusion randomly removes some units from the network during training, forcing the network to learn redundant representations, which can improve its ability to generalize to the new data. Overall, concatenation and regularization techniques are important tools in designing deep learning algorithms that can learn complex relationships and generalize well to new data.
We employed two layers; the concatenation layer typically refers to a neural network layer that combines or fuses information from multiple sources or modalities. This is common in multimodal deep learning when working with data from different sources such as text, images, and audio and wanting to combine them for a specific task.
We first defined two sets of features as input layers (input_features1 for data from the Webcam and input_features2 for data from EEG signals). We then used the regularizer layer to combine the outputs of multiple layers, effectively fusing the information from the two sets.
Additionally, it was important to adjust the number of features, activation functions, and other hyper-parameters based on our use case (4 features in the first set (Webcam), 4 features in the second set (EEG Signals)). Synchronization was achieved with both the headset and webcam operating at the same frequency; the webcam frequency as well as the headset frequency remained the same.
5. Experimental Results and Discussion
5.1. Evaluation Metrics
At the experimental level, we carried out the data preprocessing and signal visualization with the Python software V 3.8 (Jupyter Notebook, Anaconda) and used a PC with 16 GB RAM and an Intel CPU (GeForce GTX 1080) by the company Intel in Santa Clara, California, United States.
Preparing the data for classification requires the following first step. Two portions of the EEG and image dataset were isolated: the first was designated as the training data, or “80%”, and the second as the test data, or “20%”, which was further subdivided into inputs and outputs. This is consistent with the videos’ emotional moods and the EEG readings. These are able to accept a value of either 0 or 1.
We distinguished four validation measures that can be used to assess a classification algorithm.
A model’s precision is its capacity to recognize only pertinent items. Equation (3) provides the expression of the percentage of accurate positive predictions:
The capacity of a model to locate all pertinent cases is correlated with recall. Equation (4) provides the expression of the percentage of true positives found among all pertinent field truths:
The accuracy is calculated according to the following Equation (5):
F1-score is calculated by Equation (6):
where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.
5.2. Evaluation Results
To evaluate the performance of our proposal we proceeded in three steps: First, we performed an evaluation for the EEG. Next, we evaluated image processing performance and finally, we carried out assessments for the fusion of these two sources. Next, we will present the fusion results and compare the results of our technique with the existing techniques. To prove the efficiency of our proposal/technique we will conduct a comparative study with existing techniques.
5.2.1. Webcam Only Command Results
The first step consists of identifying the thresholds to be used. Indeed, we repeated the same tests for a single user and calculated these values using Equations (1) and (2). For eye state detection, the threshold was (0.15) whereas it was (0.35) for mouth state detection.
As shown in
Figure 9a, if the EAR of the mouth is equal to (0.37) which exceeds the threshold (0.35), the system detects the emotional expression of a smile. Conversely, in the scenario depicted in
Figure 9b, the ratio of the eye is (0.19) which falls over the threshold (0.15), leading to the identification of the blinking emotion.
5.2.2. EEG Signals Only Command Results
The results of facial expressions inferred from the EEG signals are presented in
Figure 10. They correspond to four distinct expressions, labeled from one to four, namely “Wink Left” (1), “Wink Right” (2), “Blink” (3), and “Smile” (4). Each expression is associated with values ranging from 0 to 1, which can be interpreted as percentages, representing the intensity or level of the respective expression.
To illustrate the effectiveness of our model in discerning and categorizing facial expressions based on EEG signals, we can examine emotion number three, which corresponds to a smile. To ensure the accuracy and consistency of our findings, we conducted a comprehensive set of 400 measurements for each emotion category (1, 2, 3, and 4). In our experiments, we obtained a precision value of 0.96 on the EEG for this particular emotion. The results of these measurements were consistently and accurately classified, with minimal discrepancies. This outcome provides strong evidence for the reliability of our model, confirming its ability to accurately identify and categorize facial expressions using EEG signals.
5.2.3. Fusion Command Results
The study findings, depicted in
Figure 11, demonstrated a consistent correlation among the results obtained from the webcam, EEG signals, and their integrated fusion in terms of accurately detecting and recognizing emotions. Additionally, the ear (action unit for the recognition of expression) of emotion was also found to align consistently with the outcomes obtained from the EEG signals and the webcam modalities. This convergence of results across multiple modalities lends further support to the robustness and reliability of our approach in capturing and interpreting emotional states.
The alignment observed between facial expressions captured by the webcam and neural activity monitored through EEG signals highlights the effectiveness of combining these two modalities in the context of controlling a wheelchair.
Confusion matrices are a valuable tool for evaluating classification models by summarizing the relationship between predicted and actual labels. They are structured as N × N tables, with one axis representing predicted labels and the other axis representing true labels. In a multiple classification problem with N classes, confusion matrices provide insights into the model’s classification accuracy and revealing its strengths and weaknesses. In our multiple classification problem, N = 4.
Figure 12 shows the confusion matrix of the fusion algorithm for our classification. Where the green color represents correct matrices and the green color represents incorrectly classified matrices. It shows that out of the 400 samples, the actual states of the emotion Wink Left are number 1 in this figure, and the model (fusion) was correctly classified (393) and misclassified (7). Similarly, for these emotions (Wink Right “2”, Blink “3”, and Smile “4”), 400 samples each (395, 396, 400) were correctly classified, however 5, 4, and 0 were incorrectly classified.
5.3. Comparison with Previous Works
The fusion method provides higher accuracy compared to EEG signals or a webcam. The accuracy of the fusion method is always higher than that of the webcam and EEG signals, as shown in
Figure 13.
For example, in epoch 11, the accuracy of fusion is up to 0.99 whereas the accuracy of EEG signals and webcam are limited to 0.95; 0.98, respectively.
The combination of facial expressions and neural signals has the potential to improve wheelchair control systems, offering intuitive interaction and precise control. This fusion of parameters enhances responsiveness and personalization, empowering individuals to navigate their surroundings with increased ease and independence.
A comparison of the proposed methodology with deep learning and other classification algorithms from the literature [
35,
36,
37,
38,
39] is presented in
Table 4. The comparison focuses on studies that were also evaluated in terms of precision, recall, accuracy, and F-score.
We have found that our proposed algorithm based on a fusion of CNN-LSTM outperforms the other works in both accuracy and precision. These results show the advantage of our networks in predicting the four emotions based on EEG signals and video recognition.
Our network architecture essentially represents a comprehensive method of emotion identification, seamlessly integrating the perceptive capabilities of image analysis with the nuanced insights derived from EEG signal processing. Late fusion creates a cooperative synergy that enables our model to overcome single-modality methods’ constraints.
6. Conclusions
This work lays the groundwork for the creation of cutting-edge assistive technology and opens up new directions for investigating the potential of human–computer interaction. We can gain a better knowledge of human cognition and behavior by combining the power of facial expressions with EEG signals, which will ultimately improve the quality of life for people who have mobility disabilities.
In this paper, we proposed a smart wheelchair control system designed to aid individuals with physical impairments in their mobility. Our approach integrates fusion between decision modes from an EEG signals sensor and a webcam images sensor, achieving an outstanding accuracy level of up to 99% in emotion recognition. The incorporation of CNN and LSTM architectures through a fusion algorithm exhibits superior performance, surpassing single-modality methodologies. This comprehensive approach not only advances the field of human–computer interaction but also contributes to assistive technologies, thereby improving the quality of life of individuals with mobility impairments.
In future endeavors, we aim to validate our study’s findings by implementing our application on an embedded system and deploying it in real-world scenarios, particularly in the control of wheelchairs. Furthermore, we plan to introduce an additional control modality, such as voice control, expanding the versatility and practicality of our proposed system.