In this subsection, the process of how to preprocess the multi-channel EEG sequences is firstly described, then the multi-band EEG topology maps and their construction process are introduced, and finally the proposed emotion recognition network ERENet is described in detail.
2.3.1. Multi-Band EEG Topology Maps
EEG signals are some multi-channel continuous time sequences with the format of “channel number × sampling time”, which are adopted as the input data format of the models in many studies [
32,
35,
36]. The disadvantage of this data format is that the electrode measurements at different spatial locations in the cerebral cortex are only aggregated together and the topological structure of EEG signals is ignored. The brain is a complex system, the completion of tasks relies on the cooperation of various regions, and there is a certain internal correlation between the various regions. The spatial information of EEG signals contains significant information related to emotional states, which is of great significance in exploring the correlation between brain regions. Aiming to preserve the topological structure of EEG signals, multi-channel EEG signal sequences are converted into 2D topology maps, so that the signals adjacent to the physical channels are still adjacent in the 2D topology map. The international 10–20 system describes the location of scalp electrodes, in which “10” and “20” mean the actual distance between adjacent electrodes is 10% or 20% of the total distance between the front and back of the skull or the left and right sides.
Figure 2a shows the spatial distribution of electrodes in the international 10–20 system, in which the electrodes marked with blue are the 32 electrodes used in the DEAP dataset. To address the loss of electrode topological location information, 32 electrodes used in the DEAP dataset are relocated to the 2D topological structure based on the spatial distribution of electrodes in the international 10–20 system. For each sampling point, 32-channel EEG signals are mapped to 9 × 9 matrix, as shown in
Figure 2b.
At present, some studies have been conducted on the activity of emotional states in different frequency bands. Alarcão et al. [
42] pointed out that the alpha, beta, and gamma bands of the frontal lobe have relatively significant information to distinguish emotional states. In addition, Frantzidis et al. [
43] observed that the features displayed in the theta band are closely related to arousal. Therefore, theta, alpha, beta, and gamma frequency bands are selected to research the emotional state characteristics of the human brain, and four 2D matrices are constructed to describe the significant information related to emotional states in four frequency bands. The construction of multi-band EEG topological maps is shown in
Figure 3. The multi-band EEG topology maps can clearly represent the changes in EEG signals on the scalp under different emotional states and can provide more abundant information integrating the frequency domain features, spatial information, and frequency band characteristics of multi-channel EEG signals.
EEG signals are non-stationary signals with strong randomness, which cannot be analyzed by some traditional methods. If EEG signals are decomposed into several short signals, the signals in a short time can be approximately stationary. Therefore, the EEG segment
of each electrode can be regarded as a stationary signal after the samples are segmented by sliding window. Fast Fourier transform (FFT) is performed on
to extract the spectral power
of theta, alpha, beta, and gamma bands as frequency domain features. The calculation of
is shown in Equation (3):
where the frequency bands of theta, alpha, beta, and gamma are 4–8 Hz, 8–13 Hz, 13–30 Hz, and 30–45 Hz, respectively. Let
denote the frequency domain features extracted from
, and
denote the frequency domain features of
electrodes extracted from EEG sample
.
According to the mapping rule described in
Figure 2, the topological positions of 32 electrodes can be obtained by relocating the electrode channels to their topological positions, as shown in
Figure 2c. The spectral power
of theta band in
is mapped into a 2D matrix
according to their respective locations, where
and
represent the height and width of the 2D matrix, respectively, and
in this paper. The remaining positions in the matrix are filled with zero. There are two reasons why zero is used to fill the remaining positions: matrix data more suitable for model input can be obtained, and the relative position of electrodes on the brain can be better simulated without introducing excess noise. This process is repeated for alpha, beta, and gamma bands. Therefore, the frequency domain features
of the four frequency bands extracted from the time slice
are converted into four 2D topological maps
through mapping, as shown in the lower left corner of
Figure 3.
will be used as an input for ERENet for learning and emotion evaluation of emotion generation mode.
2.3.2. The Model of Proposed ERENet
In this paper, fully considering the band information of EEG signals and the spatial information between various channels, a lightweight EEG-based emotion recognition network is proposed, inspired by the neuronal circuits in the human brain, as shown in
Figure 4.
The model mainly includes three modules: a multi-band discrete parallel processing module, a multi-band information exchange and reorganization module, and a weighted classification module.
The discrete parallel processing circuit architecture is an organizational mode of presenting information in the nervous system. Signals are presented and processed in parallel through discrete information channels [
44], as shown in
Figure 5. A typical example is the olfactory bulb in the brain: olfactory receptor neurons expressing the same odorant receptors are scattered in the olfactory bulb but send their axons to the same glomerulus to synapse onto the dendrites of their corresponding second-order projection neurons, forming discrete olfactory processing channels [
45,
46]. In addition, discrete parallel processing is also characteristic of the visual nervous system. Different bipolar and ganglion cells in the retina form specific connections, and different types of visual signals, such as brightness, color, and motion, are processed in parallel. Compared with serial processing, parallel processing reduces the computational depth, thus reducing the error rate and increasing the processing speed.
Different frequency bands of EEG signals have different degrees of correlation with emotional states. Therefore, EEG signals of different frequency bands can be processed in parallel in discrete information channels based on the principle of discrete parallel processing circuit architecture. Each information channel corresponds to a kind of band signal, and characteristic information specific to the frequency band signal can be extracted in each information channel. The design of the multi-band discrete parallel processing module is shown in the left box in
Figure 4. This module is mainly responsible for extracting the feature information specific to each band, and is composed of a group convolution layer, a batch normalization layer (BN) and an activation layer. The BN and the activation layer are not shown in
Figure 4.
In the group convolution layer, the input data
are grouped by frequency bands, and each frequency band’s data
are set with
convolution kernels for convolution, respectively, with stride of 1. The boundaries are filled with 0 to keep the output feature map the same size as the input image. After convolution, each 2D topology map
is encoded into a higher-level representation, as shown in Equation (4):
where
represents the
-th feature map of the band
. In the classical CNN model, a pooling layer is usually added behind the convolution layer to reduce the data dimension, but some information is often lost. In the classical CNN models, a pooling layer is usually added after the convolution layer to reduce the data dimension, which often leads to the loss of some information. In the EEG-based recognition tasks, the size of the EEG topology is much smaller than the image used in the field of computer vision. Therefore, the pooling operation is not used after the convolution layer in this study for preserving all information. After the group convolution layer, the BN layer is added to normalize the data distribution, which does not only avoid the gradient disappearance or explosion and make the gradient more predictable and stable, but also increases the training speed. Finally, the final feature representation
of the module is obtained through the activation of the Relu activation function, as shown in Equation (5):
where
represents the
-th feature representation of the band
. Moreover,
is output to the next module.
- 2.
Multi-band Information Exchange and Reorganization Module
Many studies have pointed out that the fusion of features of different frequency bands can improve the recognition accuracy [
17,
47,
48]. Therefore, aiming to effectively utilize the feature information of different frequency bands at the same spatial position, a multi-band information exchange and reorganization module is designed, which is responsible for the exchange and reorganization of feature maps, and fully extracting deep features in the feature map groups composed of different frequency bands, as shown in the middle box of
Figure 4. The multi-band information exchange and reorganization module also includes a group convolution layer, a BN layer, and an activation layer.
In the group convolution layer of this module, the feature maps
convoluted through the same frequency band information channel in the previous module are divided into different groups, so that each group
of feature maps contains feature maps from different frequency bands.
is set with
convolution kernels for convolution, respectively, with stride of 1, and the boundaries of the feature maps are filled with 0. The convolution result
of
is shown in Equation (6):
where
represents the
-th feature map. After convolution of
groups of feature maps,
different fusion feature maps can be obtained. After convolution operation, the BN layer is also set to normalize the data distribution, and the final feature representation
of the module is obtained through the Relu activation layer and output to the next module, as shown in Equation (7):
where
represents the
-th feature representation.
- 3.
Weighted Classification Module
The structure of the weighted classification module shown in the right box of
Figure 4 includes a channel-weighted pooling layer, a fully connected layer, and an output layer with a Sigmoid activation function. In the traditional CNN, the fully connected layer also accounts for abundant parameters in addition to the convolution layer. For example, when MobileNets [
49] is applied to Imagenet dataset for classification, the fully connected layer uses 1024 feature vectors as input and generates 1000 probability values, corresponding to 1000 classes. The number of parameters is more than one million, accounting for 24.33% of the total parameters. Therefore, channel-weighted pooling is proposed to replace the fully connected layer in the weighted classification module, which not only aggregates features, but also reduces abundant parameters in the classification layer.
In order to filter all the fused features and further extract features, all feature maps are directly considered instead of performing the convolution operation on a feature map from the beginning to the end, the feature values at the same position of
feature maps are aggregated, and each position is given a learnable weight according to its importance. Let
denote the fused features’ output by the previous module, and
denote the
-th feature map, where
represents the feature value at the
-th position of the feature map. Therefore, when aggregating feature values at the
-th position of all feature maps, a learnable weight
is assigned to each position, and the extracted features can be expressed as shown in Equation (8):
represents the feature extracted from feature maps
. Therefore, channel-weighted pooling not only reduces the feature dimension, but also filters different fusion features to further extract more representative features. The fully connected layer is not completely discarded, and one is added between the channel-weighted pooling and the output layer to enhance the feature representation. The final feature vector
is shown in Equation (9):
where
represents the number of neurons in the fully connected layer. In the output layer, the feature vectors
are aggregated, and the aggregated result is activated by Sigmoid to obtain the final prediction result, as shown in Equation (10):
In the traditional classification layer as shown in a, the total number of parameters is
from the feature vector
flattened into a 1D vector, then through the fully connected layer to the output layer. The number of parameters of the weighted classification module is
as shown in
Figure 6b.
represents the number of feature maps’ output by the previous module and
.
Although the weighted classification module only replaces a fully connected layer with a channel-weighted pooling layer, this module only has 10,784 trainable parameters, which is a reduction of 41,077 trainable parameters compared with the previous classification layer that was only composed of full-connection layers. Moreover, the model with channel-weighted pooling layer achieves better results than other models with similar parameter quantities.