4.1. EEG Sensitivity to Different Fact-Extraction Methods
4.1.1. Based on Cosine Similarity
This section reports the inter-class similarities of the EEG vectors corresponding to the factual word groups and non-factual word groups obtained from four factual word extraction methods and the intra-class similarities for each group. The quantities of factual and non-factual words extracted using the four methods (dependency, entity, pos, and TF-IDF) are shown in
Table 2.
Based on the description in
Section 3, each 1220 word was recorded with 1.5 s of EEG signal. After averaging across the 28 channels, each word corresponds to a 375-dimensional EEG vector. When computing the cosine similarity of EEG vectors for factual word groups and non-factual word groups obtained from different extraction methods, three distinct time segments of the 1.5-s EEG signal were analyzed for each extraction method. These three segments are as follows: (1) the entire 1.5-s duration, encompassing 375 sampling points, denoted as “Overall”; (2) the time window potentially capturing the N400 ERP component, ranging from 250 ms to 500 ms, which includes 62 sampling points, denoted as “N400”; and (3) the time window potentially capturing the P600 ERP component, ranging from 500 ms to 1000 ms, containing 125 sampling points, denoted as “P600”.
Firstly, the inter-class cosine similarity between the EEG vectors of fact word groups and non-fact word groups obtained from the four extraction methods was computed, with results shown in
Table 3.
It can be observed that the inter-class cosine similarity between the EEG vectors of fact word groups and non-fact word groups extracted using the Entity and Dependency methods is close to −1. The result indicates that the EEG signals for these fact word groups and non-fact word groups, as identified by these two methods, are highly dissimilar in direction and exhibit clear differentiation. In contrast, the EEG signals’ differentiation between the two groups of words extracted based on pos is not pronounced, with the vectors even displaying a degree of positive correlation. Furthermore, the two groups of words extracted using TF-IDF exhibited significant differences in the N400 (often indicative of semantic comprehension) and P600 (typically representing syntactic processing) time windows, suggesting that the TF-IDF method captures some semantic information (as evidenced by the negative similarity in the N400 time window) but overlooks crucial syntactic information (evidenced by the positive similarity in the P600 window).
Subsequently, the intra-class similarity for factual word groups obtained from the four extraction methods was computed, with the results shown in
Table 4.
All extraction methods yielded a very low intra-class cosine similarity for the factual word group, nearing 0. The result indicates a minimal correlation between the EEG vectors corresponding to the factual words extracted by these four methods. The angles between the vectors are nearly orthogonal, reflecting the brain’s highly independent interpretation of each word within the factual word group.
Lastly, the intra-class similarity for non-factual word groups obtained from the four extraction methods was computed, with the results presented in
Table 5.
All extraction methods resulted in a very low intra-class cosine similarity for the non-factual word group, again approaching 0. The result indicates a minimal correlation between the EEG vectors corresponding to the non-factual words extracted by these four methods, and the brain’s comprehension of each word within the non-factual word set is highly independent.
4.1.2. Based on EEG Signals
In order to visually illustrate the disparities in brain activity between factual and non-factual words acquired through four extraction methods (dependency, entity, pos, TF-IDF), corresponding EEG signal curves were graphed, as depicted in
Figure 4,
Figure 5,
Figure 6 and
Figure 7.
Figure 4 corresponds to the four extraction techniques. The curves represent the average EEG signals over 1.5 s for 14 participants reading all factual or non-factual words. The semi-transparent regions on either side of the curves indicate the voltage’s standard error (SE) at corresponding time points. In the legend, “N” represents the number of EEG signals used for computing the average. For instance, in
Figure 4, 7308 indicates that 14 participants read 522 factual words. For each extraction method, 4 subfigures display the voltage from different channels (Area of Interest, AOI). From left to right, these are all 28 channels (Overall), 10 channels located above the frontal lobe (Frontal), 10 channels above the central region (Central), and 8 channels above the parietal lobe (Parietal).
Observation of the EEG signal curves reveals that it might be challenging to discern a clear distinction between the fact and non-factual word sets derived using the “Dependency”, “Pos”, and “TF-IDF” methods, as their voltage curves do not show significant differences. However, a marked contrast is observed with the curves obtained using the “Entity” method. In all four AOIs the EEG signal curves for factual and non-factual word sets identified through the “entity” method never intersect. This distinction is even more pronounced in the Overall, Frontal, and Central AOIs, where the signal curves, including their semi-transparent standard error margins, remain separate. This outcome aligns with previous findings based on cosine similarity. Additionally, research has shown that human brains exhibit discernible differences in electro-cortical manifestations when processing common and proper nouns during reading tasks [
39]. Proper nouns are often extracted as factual words using the Entity method. These results further underscore that the distinction between groups of factual and non-factual words extracted through named entity recognition (entity) is most pronounced in the EEG signals.
4.1.3. Based on Global Field Power (GFP)
GFP values reflect the overall activity level of EEG signals within specific time windows [
40]. Higher GFP values typically indicate stronger brain electrical activity, while lower values denote weaker activity. In this section, we calculated the GFP values of factual and non-factual words extracted using different methods across three time windows, as shown in
Table 6.
Throughout the entire time window, factual words extracted using the Entity method exhibited the highest GFP values. This indicates that the overall brain activity level is strongest when processing these types of words. The result might suggest that such words pose more significant cognitive challenges, requiring more cognitive resources for processing, or they may trigger more complex cognitive processing mechanisms in the brain. The N400 time window (250–500 ms) is associated with semantic understanding. During this phase, non-factual words extracted via the TF-IDF method elicited the strongest electrical activity in the brain, suggesting that these words stimulate more brain activity during semantic processing than other types. Additionally, factual words extracted using the Entity method also showed the second highest GFP values, indicating that these words also activate the brain to a certain extent regarding semantic understanding. The P600 time window (500–1000 ms) is primarily related to syntactic processing. In this stage, factual words extracted using the Entity method again exhibited the highest GFP values, suggesting that these words might have particular importance or complexity in syntactic understanding, thereby inducing stronger brain electrical activity. These findings may reveal differences in cognitive processing among different types of words and how the brain dynamically adjusts its processing strategies based on the attributes of words, such as factuality. This information is invaluable for understanding the neural mechanisms of language processing.
4.2. Correlation between the Human Brain and Models
This research aims to examine whether the methods of extracting factual words (features), the models for generating word vectors (models), and the time windows of EEG signals (TOI) would significantly influence the representational similarity between human brain activity and model across different AOIs. Therefore, a three-way repeated measures Analysis of Variance (ANOVA) can be conducted for each AOI. The three independent variables encompass features, models, and TOIs. The features can be categorized into four types: “dependency”, “entity”, “pos”, and “TF-IDF”; the models for generating word vectors can be divided into three types: “GloVe”, “BERT”, and “GPT-2”; and the time windows of the EEG signals are split into three categories: “0–1500 ms (Overall)”, “250–500 ms (N400)”, and “500–1000 ms (P600)”. This study conducted detailed statistical analyses on the representational similarity obtained for each combination, aiming to ascertain if different Features, Models, and TOIs would influence the representational similarity between the human brain and the model.
4.2.1. All 28 Channels (Overall)
The descriptive statistical results of the RSA scores for the human brain and model across all channels were presented in
Table A1.
Figure 8 depicted the distribution of representational similarity between human brain activity and language models at three different TOIs for various models and fact word extraction methods using boxplots. The rhombus symbols represent outliers, indicating data points that differ significantly from other observations. The error bars represent the 95% confidence interval (
CI). At the 0–1500 ms TOI, the highest RSA score of 0.00302 was observed between the human brain and the GloVE model, using the TF-IDF method for fact word extraction. At the 250–500 ms TOI, the highest RSA score reached 0.00283 for the human brain and the GloVE model, achieved with the Pos fact word extraction method. For the 500–1000 ms TOI, employing the TF-IDF method for fact word extraction yielded the highest RSA score of 0.00309 with the BERT model.
Before conducting a repeated measure ANOVA on the RSA scores of the 28 channels, we first performed Mauchly’s Sphericity Test. We reported the results of the sphericity test for the main effects and interactions that met the assumption of sphericity. For effects that violated this assumption, we applied the Greenhouse–Geisser correction. On these 28 channels, the main effects of ‘Model’, ‘Feature’, and ‘TOI’, as well as the interaction effect between ‘Model’ and ‘Feature’, met the assumption of sphericity. Thus, we conducted a repeated measures ANOVA using the original degrees of freedom. For other three-way interactions that did not meet the assumption, we used the Greenhouse–Geisser correction. The results of the main and interaction effects are shown in
Table A5.
For the 28 channels, the main effect results were as follows: For the 3 models, the sphericity assumption was met (p = 0.782), with an ANOVA result of F(2) = 0.334, p = 0.719, indicating that the main effect of the chosen model categories on RSA scores was not significant. For the four extraction methods, the sphericity assumption was met (p = 0.98) with an ANOVA result of F(3) = 0.701, p = 0.557, suggesting that the main effect of the chosen fact word extraction methods on RSA scores was not significant. For the TOIs, the sphericity assumption was met (p = 0.155) with an ANOVA result of F(2) = 0.334, p = 0.719, indicating that the main effect of the chosen periods of interest on RSA scores was not significant.
The results for the interactions were as follows: For the interaction between the model and extraction method, the sphericity assumption was met (p = 0.970) with an ANOVA result of F(6) = 0.68, p = 0.666, suggesting that the interaction between the two factors was not significant. For the interaction between the models and TOIs, the sphericity assumption was not met (p = 0.046), so the Greenhouse–Geisser correction was applied, resulting in F(2.338) = 0.224, p = 0.833, indicating a non-significant interaction. For the interaction between the extraction method and TOI, the sphericity assumption was not met (p = 0.044), so the Greenhouse–Geisser correction was applied, resulting in F(3.112) = 0.353, p = 0.794, indicating a non-significant interaction. For the three-way interaction, the sphericity assumption was not met (p < 0.001). After applying the Greenhouse–Geisser correction, the result was F(4.336) = 2.253, p = 0.070, suggesting that the interaction among the three factors was insignificant.
4.2.2. Ten Channels Located above the Frontal Lobe (Frontal)
A total of 10 previously selected 28 EEG channels are located above the frontal lobe. The descriptive statistical results of the RSA scores for the brain and model based on these 10 EEG channels are presented in
Table A2.
Figure 9 depicts the distribution of representational similarity between human brain activity and language models at three different TOIs for various models and fact word extraction methods using boxplots. The rhombus symbols represent outliers, indicating data points that differ significantly from other observations. The error bars represent the 95%
CI. At the 0–1500 ms TOI, the highest RSA score of 0.00435 was observed between the human brain and the GloVE model, using the TF-IDF method for fact word extraction. At the 250–500 ms TOI, the highest RSA score reached 0.00361 for the human brain and the GloVE model, achieved with the Dependency fact word extraction method. For the 500–1000 ms TOI, employing the Entity method for fact word extraction yielded the highest RSA score of 0.00401 with the GloVe model.
The results for the main and interaction effects from the repeated measures ANOVA are presented in
Table A6. For the 10 channels above the frontal lobe, the main effect results were as follows: For the 3 models, the sphericity assumption was not met (
p = 0.022). After applying the Greenhouse–Geisser correction, the result was
F(1.36) = 5.301,
p = 0.025, indicating a significant main effect of the selected three model categories on the RSA scores. The sphericity assumption was met for the four extraction methods (
p = 0.865), resulting in
F(3) = 0.851,
p = 0.475, indicating that the main effect of the selected fact-word extraction methods on the RSA scores was not significant. For the TOI, the sphericity assumption was met (
p = 0.647), resulting in
F(2) = 3.706,
p = 0.038, indicating a significant main effect of the selected TOIs on the RSA scores.
Interaction results were as follows: For the interaction between models and extraction methods, the sphericity assumption was met (p = 0.073) with F(6) = 2.139, p = 0.058, indicating no significant interaction. For the interaction between models and TOI, the sphericity assumption was not met (p = 0.039), and after applying the Greenhouse–Geisser correction, the result was F(2.418) = 0.651, p = 0.556, indicating no significant interaction. The sphericity assumption was met for the interaction between the extraction method and TOI (p = 0.507), resulting in F(6) = 1.519, p = 0.183, indicating no significant interaction. The sphericity assumption was not met for the interaction among the three factors (p < 0.001). After applying the Greenhouse–Geisser correction, the result was F(5.054) = 1.081, p = 0.379, indicating no significant interaction.
Given the significant main effects of the three models and three TOIs on the RSA scores, post-hoc tests were conducted on these two factors. The pairwise comparison results after the Bonferroni correction were presented in
Table 7 and
Table 8.
From the perspective of the 10 channels above the frontal lobe, although the ANOVA results indicated significant differences among the 3 models, the post-hoc test results suggested no significant differences between any 2 of the 3 models.
Regarding the 3 TOIs, post-hoc test results indicated that the RSA for the 0–1500 ms interval is significantly lower than the 250–500 ms interval (p = 0.041), but there was no significant difference compared to the 500–1000 ms interval (p = 0.772). Additionally, there was no significant difference between the RSAs for the 250–500 ms and 500–1000 ms intervals (p = 0.448).
4.2.3. Ten Channels Located above the Central Region (Central)
A total of 10 previously selected 28 EEG channels were located above the central region. The descriptive statistics for the RSA scores of the brain and the model on these 10 EEG channels were shown in
Table A3.
Figure 10 depicted the distribution of representational similarity between human brain activity and language models at three different TOIs for various models and fact word extraction methods using boxplots. The rhombus symbols represent outliers, indicating data points that differ significantly from other observations. The error bars represented the 95%
CI. At the 0–1500 ms TOI, the highest RSA score of 0.00340 was observed between the human brain and the BERT model, using the Pos method for fact word extraction. At the 250–500 ms TOI, the highest RSA score reached 0.00346 for the human brain and the GPT-2 model, achieved with the Pos fact word extraction method. For the 500–1000 ms TOI, employing the TF-IDF method for fact word extraction yielded the highest RSA score of 0.00318 with the GloVe model.
The results for the main and interaction effects from the repeated measures ANOVA were presented in
Table A7. For the 10 channels located above the central region, the results for the main effects analysis were as follows: For the 3 models, the sphericity assumption was met (
p = 0.321), resulting in an ANOVA outcome of
F(2) = 0.048,
p = 0.953, indicating that the main effect of the chosen model categories on RSA scores was not significant. For the four extraction methods, the sphericity assumption was not met (
p = 0.014), so the Greenhouse–Geisser correction was applied, resulting in a corrected ANOVA of
F(1.709) = 0.166,
p = 0.816, indicating that the main effect of the chosen fact word extraction methods on RSA scores was not significant. For the TOIs, the sphericity assumption was not met (
p = 0.031), leading to a corrected ANOVA result of
F(1.389) = 4.140,
p = 0.046, suggesting a significant main effect of the chosen TOIs on RSA scores.
The interaction between the model and extraction method met the sphericity assumption (
p = 0.073), with an ANOVA result of
F(6) = 0.68,
p = 0.773, indicating no significant interaction. For the interaction between the model and TOI, the sphericity assumption was not met (
p = 0.003), so the Greenhouse–Geisser correction was applied, yielding a corrected ANOVA of
F(1.948) = 1.223,
p = 0.310, indicating no significant interaction. The sphericity assumption was violated for the interaction between the extraction method and TOI (
p = 0.005), leading to a corrected ANOVA result of
F(3.588) = 0.395,
p = 0.791. For the three-way interaction, the sphericity assumption was not met (
p = 0.020), resulting in a corrected ANOVA of
F(4.308) = 2.150,
p = 0.082. Given the significant main effect of the TOIs on RSA scores of the brain and model, post-hoc tests were conducted with Bonferroni correction, and the paired comparison results are shown in
Table 9.
For the three TOIs, pairwise differences were not significant. However, the RSA corresponding to N400 was higher than overall and P600.
4.2.4. Eight Channels Located above the Parietal Lobe (Parietal)
A total of 10 previously selected 28 EEG channels were located above the parietal lobe. The descriptive statistics for the RSA scores of the brain and the model on these 8 EEG channels are shown in
Table A4.
Figure 11 depicts the distribution of representational similarity between human brain activity and language models at three different TOIs for various models and fact word extraction methods using boxplots. The rhombus symbols represent outliers, indicating data points that differ significantly from other observations. The error bars represented the 95%
CI. At the 0–1500 ms TOI, the highest RSA score of 0.00393 was observed between the human brain and the GloVE model, using the TF-IDF method for fact word extraction. At the 250–500 ms TOI, the highest RSA score reached 0.00308 for the human brain and the GPT-2 model, achieved with the Pos fact word extraction method. For the 500–1000 ms TOI, employing the Entity method for fact word extraction yielded the highest RSA score of 0.00407 with the GloVe model.
The results for the main and interaction effects from the repeated measures ANOVA are presented in
Table A8. For the eight channels located above the parietal lobe, the main effects analysis results were as follows: For the three models, the sphericity assumption was not met (
p = 0.012), leading to the application of the Greenhouse-Geisser correction, yielding a corrected ANOVA of
F(1.316) = 8.35,
p = 0.007, indicating a significant main effect of the model categories chosen on the RSA scores. For the four extraction methods, the sphericity assumption was not met (
p = 0.021), necessitating the Greenhouse-Geisser correction, with the corrected ANOVA result being
F(1.875) = 0.415,
p = 0.652, implying the chosen fact word extraction methods did not have a significant main effect on RSA scores. The sphericity assumption was met for the TOIs (
p = 0.156), resulting in an ANOVA outcome of
F(2) = 3.584,
p = 0.042, suggesting a significant main effect of the selected TOI on RSA scores.
The interaction between the model and extraction method did not meet the sphericity assumption (p = 0.033), leading to the Greenhouse-Geisser correction and a corrected ANOVA result of F(3.468) = 2.012, p = 0.118, indicating no significant interaction. The interaction between the model and TOI did not meet the sphericity assumption (p = 0.027), leading to a corrected ANOVA of F(2.609) = 1.706, p = 0.189 after applying the Greenhouse-Geisser correction. The interaction between the extraction method and TOI did not meet the sphericity assumption (p = 0.043), resulting in a corrected ANOVA of F(3.382) = 1.066, p = 0.378 after the Greenhouse-Geisser correction. For the three-way interaction, the sphericity assumption was met (p = 0.051), leading to an ANOVA result of F(12) = 1.071, p = 0.388, implying no significant interaction.
Given the significant main effects of the three models and TOIs on the RSA scores of the brain and model, post-hoc tests were conducted using the Bonferroni correction, with paired comparison results presented in
Table 10 and
Table 11.
From the perspective of the eight channels above the parietal lobe, post-hoc tests for the three models indicated that while there were no significant differences between the GloVe model and the BERT (p = 0.054) or GPT-2 models (p = 1), the RSA scores for the BERT model with the brain were significantly lower than those for the GPT-2 model with the brain (p = 0.009).
Regarding the three TOIs, pairwise differences were not significant.
4.2.5. A Brief Summary
Our study conducted a detailed analysis of representational similarity across four AOIs: Overall, Frontal, Central, and Parietal. Each AOI corresponded to a specific set of EEG channels. We evaluated the similarity in brain and model reading activities under varying conditions, encompassing four extraction methods, three models, and three TOIs, as detailed in
Table 12. Cells in the table marked with an asterisk (*) signify significant main or interaction effects from the repeated measures ANOVA, with ‘ns’ indicating no significant effects.
Our findings demonstrated no significant interaction effects among extraction methods, model selection, and TOIs across all AOIs, suggesting that each factor independently influences RSA scores. This implied that each variable, such as extraction method, model selection, or TOI, uniquely contributed to the correlation between human brain activity and the models.
Regarding TOIs, different intervals corresponded to distinct brain linguistic processing stages. For example, the N400 was typically linked with semantic violations or unexpected words, with the 250–500 ms TOI often used to explore this time window. This suggested that this particular TOI encompassed the semantic interpretation process. Results from channels above the Frontal lobe indicated higher RSA scores during the TOI, including the N400, compared to the overall duration. This suggested that the models captured semantic processing-related information to some extent, showing heightened sensitivity to semantic data over other types, like syntax or background knowledge.
Regarding the channels above the Parietal lobe, RSA scores for GPT-2 were notably higher than those for BERT. This could imply that GPT-2’s processing strategies resonated more with the linguistic processing patterns of this specific brain region. From a model perspective, unlike BERT’s bidirectional masked language model approach, GPT-2’s unidirectional autoregressive model might have more closely mirrored participants’ sequential text reading pattern. They could predict the next word in a text but could not see it in advance, akin to GPT-2’s processing style.