1. Introduction
The interaction between humans and artificial intelligence (AI) still lacks the level of engagement and synchronization that symbolizes the interactions between humans. The primary goal of the WithMe project (The WithMe project is a research project funded by the Research Foundation Flanders (FWO). More information can be found at
https://researchportal.be/en/project/withme-making-human-artificial-intelligence-interactions-more-entraining-and-engaging, accessed on 1 November 2023.) is to thoroughly study the processes that occur in the human brain during joint activities with another individual, such as working towards shared objectives [
1]. The brain signals collected in this study were primarily indicative of attention but also of emotion and reward. The purpose of this research was to determine relevant electroencephalography (EEG) features indicative of attention using machine learning (ML).
To this end, a specific experiment was designed. Temporal audiovisual integration and support of visual attention by sound was well demonstrated in the pip-and-pop experiment [
2]. The pip-and-pop experiment is based on a visual search, which does not lead to a strong visually evoked potential. Moreover, as we expected that the rhythmic presentation of target stimuli also affects working memory, the task was replaced with a modified digit-span task, where five target digits had to be remembered and reported in our experiment [
1]. This task involves visual attention, working memory, and sequence recall. To investigate the role of attention, we directly measured the brain activation by means of EEG. Specifically, event-related potentials (ERPs) have been shown to be excellent tools for studying attention [
3,
4]. Risto Näätänen was a pioneer in this domain, as they studied the connection between ERPs and attention, which led to the discovery of (auditory) mismatch negativity ERP [
5,
6,
7,
8]. Additionally, research has shown that the amplitude of P300 is directly related to the amount of attentional resources available for stimulus processing [
8,
9,
10,
11]. The P300 ERP is observed to be elicited for deviant stimuli in a sequence of standard stimuli, where the deviant stimuli are in some way more relevant to the presented task [
12,
13,
14]. In our experiment, we thus expected that the targets would elicit a P300 ERP. Research showed that the P300 actually consists of two subcomponents: P3a and P3b [
15]. P3a generally reaches its peak around 250 ms to 280 ms after a stimulus and is associated with attention-related brain activity [
16]. On the other hand, the P3b peak can vary in latency, lying between 300 ms and 500 ms post-stimulus [
15]. P3b is elicited by improbable events, provided that the improbable event is somehow relevant to the task at hand [
17]. In our experimental setting, we expected to elicit a P3a, as the target stimuli were not scarce (there are approximately 50% targets and 50% distractors), and our experiment was designed to evoke attention. We did not expect to elicit a P3a for distractors, as subjects should not pay attention to them.
The goal of this work was to accurately classify whether a target or distractor stimulus was presented to the subject based on the subject’s EEG data. For this purpose, we applied different existing ML methods to classify EEG data and investigate which method performed best on our specific use case. As we expected to elicit attention when a target was shown (and not when a distractor was shown), the trained ML was effectively an attention detector. We expected the attention to manifest itself in the form of a P3a ERP, and therefore we expected that the model would base its predictions on the presence of a P3a peak. Detecting P3a signals and, more broadly, P300 signals has a wide range of applications [
18,
19], particularly in P300-based brain computer interfaces (BCIs) [
20], for example, in spellers [
21,
22,
23] and intelligent home control systems [
24,
25]. These applications can be of great help for patients suffering from amyotrophic lateral sclerosis (ALS) or spinocerebellar ataxia, as they can enable them to communicate in a daily environment [
21,
23,
26,
27]. In the literature, a wide array of techniques have been reported for classifying and detecting P300 [
28]. Some techniques rely on a data transformation and subsequently use logistic regression to classify the transformed data, for example, xDAWN + RG [
29,
30,
31,
32]. Recently, deep learning approaches, primarily based on convolutional neural networks (CNNs), for example, EEGNet [
33,
34,
35], have also gained in popularity [
36,
37,
38]. Finally, as EEG data are essentially heavily correlated multivariate time series, it is possible to apply standard time series classification techniques as well [
39,
40,
41].
Building BCIs that are trained on multiple subjects and generalize well to previously unseen subjects holds significant value [
42]. Indeed, BCIs often need to be retrained or at least calibrated for the end user [
43], which is a costly and user-unfriendly process [
44,
45]. However, due to intersubject variability in EEG data, training models that generalize to multiple-subjects (cross-subject (CS) models) is a harder task than training models for one-subject (individual subject (IS) models) [
44,
45,
46]. For this reason, we also investigated the hypothesized drop in performance when transitioning from IS to CS models. Additionally, the ML models should be able to make predictions in real time, as this is essential in real-world BCI applications.
Finally, we analyzed which EEG channels and time points were used by our models to make its predictions and checked whether these align with the expected P3a attention signature. However, ML models such as CNNs are considered “black boxes”, as no clear explanation exists for the decisions made by these models [
47]. The rapidly emerging and improving field of explainable AI (xAI) aims to tackle these issues by providing insights into ML models’ decision-making processes. Some xAI techniques that are often used to gain insights into EEG classification models are local interpretable model-agnostic explanations (LIME) [
48,
49], DeepLIFT [
33,
50,
51], and saliency maps [
52,
53,
54], among others.
In summary, we aimed to enhance the interaction between humans and AI and designed a novel experiment for this purpose. Specifically, we considered building a ML model to recognize targets shown to a subject, which equates to creating an attention detector. These models should ideally generalize well to previously unseen subjects. The primary contributions of this study are:
Training of state-of-the-art classification methods to accurately predict target and distractor stimuli based on EEG data.
Analysis of the performance difference between IS and CS models.
Investigation into which EEG channels and time points were important for the model predictions, using xAI.
Ultimately, the contributions of this research collectively advance our understanding of human–AI interaction and will aid in the development of more effective BCIs and their associated applications.
The remainder of this paper is structured as follows:
Section 2.1 introduces the WithMe experiment and dataset, while
Section 2.2 explains the data preprocessing routine.
Section 2.3 illustrates the classification problems and provides a description of the classification methods used in this study.
Section 3 presents the results and provides an in-depth analysis of the best-performing model. This section also contains an extensive discussion of the achieved results. Finally, in
Section 4, we draw conclusions and provide possible directions for future research.
4. Conclusions and Future Work
The WithMe project has led to the collection of a large, novel EEG dataset that can be used to create ML methods to automatically detect attention using P3a ERPs in single-trial data. This is of great importance to BCIs, as they often rely on the P3a, r, more broadly, the P300 ERP and have a wide range of applications.
We successfully achieved the goal in this study, which was to classify target and distractor stimuli based on the subject’s EEG data. To achieve this goal, we studied four classification methods that differed significantly in origin and complexity. We investigated the performance of these methods both as IS and CS models, with the latter being the most practically relevant due to its generalization capabilities. For the IS models, xDAWN + RG and EEGNet obtained an accuracy of 76%, outperforming MiniRocket and Rocket. While EEGNet was able to obtain the same accuracy of 76% in the CS case, the accuracy of xDAWN + RG dropped to 0.73%. We attribute this difference to the larger complexity of EEGNet, which likely enables it to generalize better to previously unseen subjects. The drop in performance between IS and CS models was not as pronounced as we expected it to be and was even nonexistent for EEGNet. We attributed this to the fact that the CS models had approximately 42 times more training data available. The EEGNet CS model performed slightly better on samples recorded under conditions Con3 and Con4, which were the conditions that included auditory support. While EEGNet achieved the best performance overall, it also had the highest model complexity (highest number of trainable parameters) and took the longest time and most computing resources to train. However, all four models were able to make predictions in real time. This property is essential for real-world human–AI interaction experiments and applications.
Finally, the application of xAI enabled us to investigate which EEG channels and time points were used by the otherwise black box EEGNet CS model to make its predictions. Indeed, using saliency maps, we concluded that the model primarily based its prediction on the values of the electrodes in the parietal-occipital region between 200 ms and 300 ms after the stimulus. This is in line with our hypotheses, as we expected to elicit an attention-related P3a ERP in the parietal-occipital region of the brain when the subject saw a target digit.
In conclusion, we achieved the goal of accurately classifying targets and distractors based on a subject’s EEG data. At the same time, our work contributes to the development of more effective BCIs and their applications. Finally, we validated the EEG data collected in the WithMe experiment.
While this study provides valuable insights into attention detection using EEG data, it is important to acknowledge some limitations. For example, as mentioned in
Section 2.3, part of the data used to train the model were labeled incorrectly, as the ground truth labels were based on the predefined labels of the experiment rather than the subject’s perceived class. A possible solution is to limit the data to samples where the entire sequence is reported correctly. However, this means that we would lose a lot of data, which would in turn decrease the performance of the models. Alternatively, we could remove all “bad sequences”, where a bad sequence is defined as a sequence in which none of the targets were remembered correctly. This could be caused by either incorrectly identifying the stimuli or by bad memory management, despite correctly identifying the targets and distractors. However, the number of answers that did not include at least one of the target digits (regardless of its place in the sequence) is negligible.
In future work, an experiment dedicated to attention should be used to circumvent the limitations regarding bad labels, as described in
Section 4. This would allow for labels that exactly correspond to the subject’s perception of a stimulus, which would in turn lead to more accurate attention detectors. The ultimate goal could then be to use this attention detector in a BCI to detect whether a subject paid attention. In case they did not, the BCI could repeat the sequence or stimulus to make sure that the subject can act accordingly. This could also improve learning systems, that is, systems that know whether a student actually paid attention to the provided information [
75,
76]. Regarding the training and optimization of ML models, it would be interesting to include an exhaustive feature selection procedure to allow the ML model to focus on the (most) relevant features. Additionally, we want to explore other ways to enable CS generalization, for example, using transfer learning [
77,
78]. This could further increase the generalization performance of all methods. In particular, this has the potential to elevate the performance of lightweight models such as xDAWN + RG to that of the computationally expensive EEGNet. While this work focuses on the detection of attention using epoched EEG data, the experiment can also be used to study working memory [
1]. Indeed, the complete sequence EEG data should permit an investigation regarding working memory and whether it is influenced by auditory and/or rhythmic support.